Model Tuning Project 6: Credit Card Users Churn Prediction
To explore and conduct a statistical analysis of the Thera bank customer information data dataset and extract insights using Exploratory Data Analysis (EDA) to build the best possible model that will give the required performance necessary to help the bank improve its services so that customers do not renounce their credit cards
Provide the statistical analysis to complete the following tasks:
BankChurners.csv - The cvs file contains the customer base for Thera bank that includes the customer profile and personal information in regard to customer accounts, and contains the following fields:
# Import and Load Library Packages
# Import necessary and important packages and Libraries
import os # os package in case os functions are required - PF-W1-1.12
import numpy as np # For its data frame operations
import scipy # For numerical routines and statistics functions
import pandas as pd # Pandas library used for data manipulation and analysis
# import pandas_profiling # Needed for running Pandas Profiling Report
# from pandas_profiling import ProfileReport
import seaborn as sns # for visualisations (use semi-colon ; when using sns.)
import scipy.stats as stats # for probability distributions and statistical functions
import matplotlib.pyplot as plt # seaborn based matplotlib library for plots/visualisations
sns.set(color_codes=True) # True-adds background to the graphs and plots, else blank
sns.set_style('darkgrid') # Set parameter style to 'darkgrid'
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
# "%" Python to display seaborn graphs inJupyter notebook
# Python warning filter
import warnings # Must import 'warnings' to define filter in next line
warnings.filterwarnings("ignore") # Filters out python "Pink" programming warning messages
# importing sklearn libraries
from sklearn.impute import SimpleImputer # Library to impute missing values
from sklearn.pipeline import Pipeline # Library for creating pipelines
from sklearn.preprocessing import StandardScaler, MinMaxScaler # Library for data scaling
from sklearn.model_selection import train_test_split # Library to split data
from sklearn import metrics # Library to apply different metric scores
from sklearn.svm import SVC # Library to apply support vector machines
# importing libraries for model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,)
from xgboost import XGBClassifier
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# importing libraries for applying different metrics
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import (
f1_score,
roc_auc_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,)
# importing libraries to oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Removes the limit from the number of displayed columns and rows. - SL-W2-2.2
# To see the entire dataframe set display to "None"
pd.set_option('display.max_columns', None) # None displays all columns
pd.set_option('display.max_rows', 200) # default=200, None displays all columns
# Normally this cell provides no feedback or output, unless there is an error
# The following code has been addded to provide coder feedback that the cell ran properly
print('Import load complete') # Indicator when cell finsihed running
print('numpy version:\t',np.__version__) # Checks numpy version
print("pandas version:\t",pd.__version__) # Checks pandas version
print('seaborn version:',sns.__version__) # Checks seaborn version
print('scipy version:',scipy.__version__) # Checks scipy verion
# print ('\n',os.getcwd()) # Used to display working directory path/name if needed
Import load complete numpy version: 1.20.3 pandas version: 1.3.1 seaborn version: 0.11.2 scipy version: 1.7.1
# Read the Dataset, Copy to Second Dataset, Display Dataset Head
data = pd.read_csv("BankChurners.csv") # Dataset loading from a csv file
df = data.copy() # Copying dataset to df dataframe
# following function use to verify Excel load/copy ran properly
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns.') # f-string - SL-W2-2.2
display(df.head()) # Display first 5 rows of data
display(df.dtypes.value_counts()) # display value counts of data types
There are 10127 rows and 21 columns.
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
int64 10 object 6 float64 5 dtype: int64
df.info() # Provides detailed info of each column attribute
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
# Coverting multiple Columns with Object Data Types - SL-W2 MLS 10
for coltype in df.columns[df.dtypes == 'object']: # Apply for loop to all object columns
df[coltype] = df[coltype].astype('category') # Apply conversion to category
print ("Object data types converted to Category data type") # to denote cell ran/finished
Object data types converted to Category data type
# searching and counting the number of duplicate records within the dataset
df.duplicated().sum()
0
No duplicate records found within the dataset
Typically, client ID's are unique to each customer, therefore provide no analytical value and can be immediately from the dataset
# Finding the Count of Duplicate Values in a Specific Column
df.duplicated(subset='CLIENTNUM').sum()
0
No duplicate entries within the CustomerID column
# findiing the number of unique entries within a single attribute
df.CLIENTNUM.nunique()
10127
All 10127 CLIENTNUM column entries are unique
Initial Observations and Insights of Dataset
CLIENTNUM attribute
# Dropping the CLIENTNUM attribute from the dataset
df.drop(['CLIENTNUM'], axis=1, inplace=True) # Drops the attribute column from the dataframe
# Checking for Missing Data within Dataset
# Verifying if the data has any missing entries before evaluating - Source MLS 4 Case Study
# isna() function should return in ascending order zeros if there is no missing data
print (df.isnull().sum().sort_values(ascending=False))
Education_Level 1519 Marital_Status 749 Attrition_Flag 0 Contacts_Count_12_mon 0 Total_Ct_Chng_Q4_Q1 0 Total_Trans_Ct 0 Total_Trans_Amt 0 Total_Amt_Chng_Q4_Q1 0 Avg_Open_To_Buy 0 Total_Revolving_Bal 0 Credit_Limit 0 Months_Inactive_12_mon 0 Customer_Age 0 Total_Relationship_Count 0 Months_on_book 0 Card_Category 0 Income_Category 0 Dependent_count 0 Gender 0 Avg_Utilization_Ratio 0 dtype: int64
# missing values by percentage in the dataset
print(df.isnull().sum()/ df.isnull().count() * 100)
Attrition_Flag 0.000000 Customer_Age 0.000000 Gender 0.000000 Dependent_count 0.000000 Education_Level 14.999506 Marital_Status 7.396070 Income_Category 0.000000 Card_Category 0.000000 Months_on_book 0.000000 Total_Relationship_Count 0.000000 Months_Inactive_12_mon 0.000000 Contacts_Count_12_mon 0.000000 Credit_Limit 0.000000 Total_Revolving_Bal 0.000000 Avg_Open_To_Buy 0.000000 Total_Amt_Chng_Q4_Q1 0.000000 Total_Trans_Amt 0.000000 Total_Trans_Ct 0.000000 Total_Ct_Chng_Q4_Q1 0.000000 Avg_Utilization_Ratio 0.000000 dtype: float64
Missing Data Results
# generate descriptive statistics that summarize the central tendency,
# dispersion and shape of the dataset’s distribution
round(df.describe().T,2) # descriptive summary of dataset
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.33 | 8.02 | 26.0 | 41.00 | 46.00 | 52.00 | 73.00 |
| Dependent_count | 10127.0 | 2.35 | 1.30 | 0.0 | 1.00 | 2.00 | 3.00 | 5.00 |
| Months_on_book | 10127.0 | 35.93 | 7.99 | 13.0 | 31.00 | 36.00 | 40.00 | 56.00 |
| Total_Relationship_Count | 10127.0 | 3.81 | 1.55 | 1.0 | 3.00 | 4.00 | 5.00 | 6.00 |
| Months_Inactive_12_mon | 10127.0 | 2.34 | 1.01 | 0.0 | 2.00 | 2.00 | 3.00 | 6.00 |
| Contacts_Count_12_mon | 10127.0 | 2.46 | 1.11 | 0.0 | 2.00 | 2.00 | 3.00 | 6.00 |
| Credit_Limit | 10127.0 | 8631.95 | 9088.78 | 1438.3 | 2555.00 | 4549.00 | 11067.50 | 34516.00 |
| Total_Revolving_Bal | 10127.0 | 1162.81 | 814.99 | 0.0 | 359.00 | 1276.00 | 1784.00 | 2517.00 |
| Avg_Open_To_Buy | 10127.0 | 7469.14 | 9090.69 | 3.0 | 1324.50 | 3474.00 | 9859.00 | 34516.00 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.76 | 0.22 | 0.0 | 0.63 | 0.74 | 0.86 | 3.40 |
| Total_Trans_Amt | 10127.0 | 4404.09 | 3397.13 | 510.0 | 2155.50 | 3899.00 | 4741.00 | 18484.00 |
| Total_Trans_Ct | 10127.0 | 64.86 | 23.47 | 10.0 | 45.00 | 67.00 | 81.00 | 139.00 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.71 | 0.24 | 0.0 | 0.58 | 0.70 | 0.82 | 3.71 |
| Avg_Utilization_Ratio | 10127.0 | 0.27 | 0.28 | 0.0 | 0.02 | 0.18 | 0.50 | 1.00 |
Observations and Insights
Breakdown of the attributes are as follows:
df.describe(include=['category']).T # descriptive summary of category type attributes
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
Observations and Insights
All attributes appear to have reasonable values and range for this portion of the analysis
Breakdown of the attributes are as follows:
print(df.nunique().sort_values(ascending=False))
Avg_Open_To_Buy 6813 Credit_Limit 6205 Total_Trans_Amt 5033 Total_Revolving_Bal 1974 Total_Amt_Chng_Q4_Q1 1158 Avg_Utilization_Ratio 964 Total_Ct_Chng_Q4_Q1 830 Total_Trans_Ct 126 Customer_Age 45 Months_on_book 44 Contacts_Count_12_mon 7 Months_Inactive_12_mon 7 Total_Relationship_Count 6 Income_Category 6 Education_Level 6 Dependent_count 6 Card_Category 4 Marital_Status 3 Gender 2 Attrition_Flag 2 dtype: int64
Next we will create an attribute list for each data type category
# create a int64 data type attribute list
int_cols = df.select_dtypes(['int64'])
int_cols.columns
Index(['Customer_Age', 'Dependent_count', 'Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Total_Revolving_Bal', 'Total_Trans_Amt',
'Total_Trans_Ct'],
dtype='object')
# Displaying unique value counts for int64 data types
for i in int_cols:
print("int64 data type counts in column: ", i)
print("Number of unique entries: ",df[i].nunique())
print("Highest entry counts (15):")
print(df[i].value_counts(ascending = False, dropna=False).iloc[:15])
print('*'*50, '\n')
int64 data type counts in column: Customer_Age Number of unique entries: 45 Highest entry counts (15): 44 500 49 495 46 490 45 486 47 479 43 473 48 472 50 452 42 426 51 398 53 387 41 379 52 376 40 361 39 333 Name: Customer_Age, dtype: int64 ************************************************** int64 data type counts in column: Dependent_count Number of unique entries: 6 Highest entry counts (15): 3 2732 2 2655 1 1838 4 1574 0 904 5 424 Name: Dependent_count, dtype: int64 ************************************************** int64 data type counts in column: Months_on_book Number of unique entries: 44 Highest entry counts (15): 36 2463 37 358 34 353 38 347 39 341 40 333 31 318 35 317 33 305 30 300 41 297 32 289 28 275 43 273 42 271 Name: Months_on_book, dtype: int64 ************************************************** int64 data type counts in column: Total_Relationship_Count Number of unique entries: 6 Highest entry counts (15): 3 2305 4 1912 5 1891 6 1866 2 1243 1 910 Name: Total_Relationship_Count, dtype: int64 ************************************************** int64 data type counts in column: Months_Inactive_12_mon Number of unique entries: 7 Highest entry counts (15): 3 3846 2 3282 1 2233 4 435 5 178 6 124 0 29 Name: Months_Inactive_12_mon, dtype: int64 ************************************************** int64 data type counts in column: Contacts_Count_12_mon Number of unique entries: 7 Highest entry counts (15): 3 3380 2 3227 1 1499 4 1392 0 399 5 176 6 54 Name: Contacts_Count_12_mon, dtype: int64 ************************************************** int64 data type counts in column: Total_Revolving_Bal Number of unique entries: 1974 Highest entry counts (15): 0 2470 2517 508 1965 12 1480 12 1434 11 1664 11 1720 11 1590 10 1542 10 1528 10 1175 10 1250 10 1560 10 1482 10 787 10 Name: Total_Revolving_Bal, dtype: int64 ************************************************** int64 data type counts in column: Total_Trans_Amt Number of unique entries: 5033 Highest entry counts (15): 4253 11 4509 11 4518 10 2229 10 4220 9 4869 9 4037 9 4313 9 4498 9 4042 9 1468 8 4348 8 4674 8 1731 8 4317 8 Name: Total_Trans_Amt, dtype: int64 ************************************************** int64 data type counts in column: Total_Trans_Ct Number of unique entries: 126 Highest entry counts (15): 81 208 71 203 75 203 69 202 82 202 76 198 77 197 70 193 74 190 78 190 67 186 79 184 73 183 80 173 68 170 Name: Total_Trans_Ct, dtype: int64 **************************************************
Observations and Insights - Integer Entry Counts
Overall attribute entry counts displayed resonable counts for this portion of the analysis
Breakdown of the attributes are as follows:
# create a float64 data type attribute list
flt_cols = df.select_dtypes(['float64'])
flt_cols.columns
Index(['Credit_Limit', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
dtype='object')
# Displaying unique value counts for float64 data types
for i in flt_cols:
print("float64 data type counts in column: ", i)
print("Number of unique entries: ",df[i].nunique())
print("Highest entry counts (15):")
print(df[i].value_counts(ascending = False, dropna=False).iloc[:15])
print('*'*50, '\n')
float64 data type counts in column: Credit_Limit Number of unique entries: 6205 Highest entry counts (15): 34516.0 508 1438.3 507 9959.0 18 15987.0 18 23981.0 12 2490.0 11 6224.0 11 3735.0 11 7469.0 10 2069.0 8 2712.0 7 2001.0 7 2802.0 7 14938.0 7 2222.0 7 Name: Credit_Limit, dtype: int64 ************************************************** float64 data type counts in column: Avg_Open_To_Buy Number of unique entries: 6813 Highest entry counts (15): 1438.3 324 34516.0 98 31999.0 26 787.0 8 701.0 7 713.0 7 953.0 7 463.0 7 990.0 6 788.0 6 1623.0 6 756.0 6 649.0 6 988.0 6 683.0 6 Name: Avg_Open_To_Buy, dtype: int64 ************************************************** float64 data type counts in column: Total_Amt_Chng_Q4_Q1 Number of unique entries: 1158 Highest entry counts (15): 0.791 36 0.712 34 0.743 34 0.718 33 0.735 33 0.744 32 0.699 32 0.722 32 0.731 31 0.631 31 0.717 31 0.690 31 0.767 31 0.703 31 0.693 30 Name: Total_Amt_Chng_Q4_Q1, dtype: int64 ************************************************** float64 data type counts in column: Total_Ct_Chng_Q4_Q1 Number of unique entries: 830 Highest entry counts (15): 0.667 171 1.000 166 0.500 161 0.750 156 0.600 113 0.800 101 0.714 92 0.833 85 0.778 69 0.625 63 0.700 59 0.571 57 0.857 53 0.659 48 0.756 47 Name: Total_Ct_Chng_Q4_Q1, dtype: int64 ************************************************** float64 data type counts in column: Avg_Utilization_Ratio Number of unique entries: 964 Highest entry counts (15): 0.000 2470 0.073 44 0.057 33 0.048 32 0.060 30 0.061 29 0.045 29 0.059 28 0.069 28 0.053 27 0.077 26 0.039 26 0.070 26 0.067 25 0.071 24 Name: Avg_Utilization_Ratio, dtype: int64 **************************************************
Observations and Insights - Float Entry Counts
Breakdown of the attributes are as follows:
cat_cols = df.select_dtypes(['category'])
cat_cols.columns
Index(['Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category'],
dtype='object')
# Displaying unique value counts for Category data types
for i in cat_cols:
print("category data type counts in column: ", i)
print("Number of unique entries: ",df[i].nunique())
print("Highest entry counts (10):")
print(df[i].value_counts(ascending = False, dropna=False).iloc[:10])
print('*'*50, '\n')
category data type counts in column: Attrition_Flag Number of unique entries: 2 Highest entry counts (10): Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ************************************************** category data type counts in column: Gender Number of unique entries: 2 Highest entry counts (10): F 5358 M 4769 Name: Gender, dtype: int64 ************************************************** category data type counts in column: Education_Level Number of unique entries: 6 Highest entry counts (10): Graduate 3128 High School 2013 NaN 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ************************************************** category data type counts in column: Marital_Status Number of unique entries: 3 Highest entry counts (10): Married 4687 Single 3943 NaN 749 Divorced 748 Name: Marital_Status, dtype: int64 ************************************************** category data type counts in column: Income_Category Number of unique entries: 6 Highest entry counts (10): Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 ************************************************** category data type counts in column: Card_Category Number of unique entries: 4 Highest entry counts (10): Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 **************************************************
Observations and Insights - Category Entry Counts
Breakdown of the attributes are as follows:
In order to conduct our EDA with this dataset, it will first be necessary to prepare the data for analysis
In order to prevent data Leakage in later analysis, we will want to remain aware of any changes to the dataset that could possibly contribute. At the same time, we will need to make adjustments to the data that will allow us to properly use it within this EDA portion of the analysis.
In an effort to maintain the integrety of the dataset, observations with missing (NaN) or "invalid" (abc) entries will be changed to reflect the entry as being "Unknown" for this portion of the EDA.
# remove "abc" and replace with "Unknown"
df['Income_Category'] = df['Income_Category'].str.replace('abc','Unknown')
# re-applying data type conversion back into a category data type
df['Income_Category'] = df['Income_Category'].astype('category')
print(df['Income_Category'].value_counts(dropna=False, ascending = False))
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 Unknown 1112 $120K + 727 Name: Income_Category, dtype: int64
As part of the effort to prevent data Leakage when we do later analysis, missing values will not be treated until the dataset has been properly split. While we would also not want to drop any observations either in preparation of the EDA or later analysis, the missing data will be accounted for, by copying over and creating a new dataset that will handle the missing values for the EDA by changing their "NaN" values to "Unknown"
# Copying the df dataset into a EDA dataset
df_eda = df.copy() # creating a new EDA dataset
Due to the significant amount of observations missing, rather than delete or impute values, we will change the missing NaN value to the string value "Unknown" to give a more accurate protrayal of the values within the dataset
# Imputing Missing NaN with string value "Unknown"
df_atrb = df_eda['Education_Level'] # creating object name
df_atrb = df_atrb.cat.add_categories('Unknown') # setting category
df_atrb = df_atrb.fillna('Unknown') # replacing NaN with string "Unknown"
df_atrb = df_atrb.astype('category') # Apply data type conversion to category
df_eda['Education_Level'] = df_atrb.copy() # copying object back into data frame
print("Number of missing vales: ",df_atrb.isnull().sum())
print("New Value Counts:","\n",df_atrb.value_counts(ascending = False, dropna=False).iloc[:10])
Number of missing vales: 0 New Value Counts: Graduate 3128 High School 2013 Unknown 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64
# Imputing Missing NaN with string value "Unknown"
df_atrb = df_eda['Marital_Status'] # creating object name
df_atrb = df_atrb.cat.add_categories('Unknown') # setting category
df_atrb = df_atrb.fillna('Unknown') # replacing NaN with string "Unknown"
df_atrb = df_atrb.astype('category') # Apply data type conversion to category
df_eda['Marital_Status'] = df_atrb.copy() # copying object back into data frame
print("Number of missing vales: ",df_atrb.isnull().sum())
print("New Value Counts:","\n",df_atrb.value_counts(ascending = False, dropna=False).iloc[:10])
Number of missing vales: 0 New Value Counts: Married 4687 Single 3943 Unknown 749 Divorced 748 Name: Marital_Status, dtype: int64
# Verifying ALL missing values have been removed
print (df_eda.isnull().sum())
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
All missing values have been imputed
# looking at 10 random rows - SL-W2-2.2
# Setting the random seed via np.random.seed to get the same random results every time
np.random.seed(1) # Set random seed
df_eda.sample(n=10) # Display random sample of 10 rows
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6498 | Existing Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Blue | 36 | 6 | 3 | 2 | 2570.0 | 2107 | 463.0 | 0.651 | 4058 | 83 | 0.766 | 0.820 |
| 9013 | Existing Customer | 38 | F | 1 | College | Unknown | Less than $40K | Blue | 32 | 2 | 3 | 3 | 2609.0 | 1259 | 1350.0 | 0.871 | 8677 | 96 | 0.627 | 0.483 |
| 2053 | Existing Customer | 39 | M | 2 | College | Married | $60K - $80K | Blue | 31 | 6 | 3 | 2 | 9871.0 | 1061 | 8810.0 | 0.545 | 1683 | 34 | 0.478 | 0.107 |
| 3211 | Existing Customer | 44 | M | 4 | Graduate | Married | $120K + | Blue | 32 | 6 | 3 | 4 | 34516.0 | 2517 | 31999.0 | 0.765 | 4228 | 83 | 0.596 | 0.073 |
| 5559 | Attrited Customer | 38 | F | 2 | Doctorate | Married | Less than $40K | Blue | 28 | 5 | 2 | 4 | 1614.0 | 0 | 1614.0 | 0.609 | 2437 | 46 | 0.438 | 0.000 |
| 6106 | Existing Customer | 54 | M | 3 | Post-Graduate | Single | $80K - $120K | Silver | 42 | 3 | 1 | 2 | 34516.0 | 2488 | 32028.0 | 0.552 | 4401 | 87 | 0.776 | 0.072 |
| 4150 | Attrited Customer | 53 | F | 3 | Graduate | Single | $40K - $60K | Blue | 40 | 6 | 3 | 2 | 1625.0 | 0 | 1625.0 | 0.689 | 2314 | 43 | 0.433 | 0.000 |
| 2205 | Existing Customer | 38 | M | 4 | Graduate | Married | $40K - $60K | Blue | 27 | 6 | 2 | 4 | 5535.0 | 1276 | 4259.0 | 0.636 | 1764 | 38 | 0.900 | 0.231 |
| 4145 | Existing Customer | 43 | M | 1 | Graduate | Single | $60K - $80K | Silver | 31 | 4 | 3 | 3 | 25824.0 | 1170 | 24654.0 | 0.684 | 3101 | 73 | 0.780 | 0.045 |
| 5324 | Attrited Customer | 50 | F | 1 | Doctorate | Single | Unknown | Blue | 46 | 6 | 4 | 3 | 1970.0 | 1477 | 493.0 | 0.662 | 2493 | 44 | 0.571 | 0.750 |
df_eda.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 10127 non-null category 5 Marital_Status 10127 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
Observations and Insights
As a result of some numerical data type attributes being of a category type rather than a integer, these attributes will need to be separated out and plotted separately using barplots for a better analysis representation of the data
# User defined function for a Histogram & Box Plot of numerical variables - SL-W2-MLS 10
def histogram_boxplot(feature, figsize=(7, 5), bins=None): # defined function
"""Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
sns.set(font_scale=2) # setting the font scale for seaborn
f2, (ax_box2, ax_hist2) = plt.subplots(nrows=2, # Number of rows of the subplot grid=2
sharex=True, # x-axis shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)}, # setting height ratios
figsize=(10,7)) # creating the 2 subplots
sns.boxplot(feature, ax=ax_box2, showmeans=True,
color="red") # star created will indicate the mean value
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins) if bins else sns.distplot(
feature, kde=True, ax=ax_hist2) # For histogram
ax_hist2.axvline(feature.mean(), color="g", linestyle="--") # mean to the histogram
ax_hist2.axvline(feature.median(), color="black", linestyle="-" ) # median to histogram
# Univariate analysis of category attribute using a defined function - source PF MLS 5
# Defined function that creates barplots that indicate percentage for each attribute -
def perc_on_bar(plot, feature):
'''
Creates a barplot that indicates percentage for each category
feature: categorical feature
Note: The function won't work if a column is passed in hue parameter
''' # Description of the defined function
total = len(feature) # length of the column
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class
x = p.get_x() + p.get_width() / 3 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # height of the plot
ax.annotate(percentage, (x, y), size = 20) # annotates the percantage
plt.show() # Display the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = df_eda[predictor].nunique()
sorter = df_eda[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Grouping integer data types into a list - SL-C-W2-MLS 2
num_column = df_eda.select_dtypes(['int64'])
# User defined function to count the number of unique values - SL-C-W2-MLS 2
def val_counts(df, col_list):
cat_num = []
cont_num = []
for col in col_list: # Loop for column attributes in dataset
if df[col].nunique()<10: # selecting attributes with less that 10 categories
cat_num.append(col) # appending column label list to cat_num
else:
cont_num.append(col)
print("cat_num: ",cat_num,'\n')
print("cont_num: ",cont_num)
return cat_num,cont_num # return list
cat_num = val_counts(df, num_column) # value counts of numerical data type attributes
cat_num: ['Dependent_count', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon'] cont_num: ['Customer_Age', 'Months_on_book', 'Total_Revolving_Bal', 'Total_Trans_Amt', 'Total_Trans_Ct']
# Grouping numerical attributes with greater than 10 unique entries or values
cont_num = ['Customer_Age', 'Months_on_book', 'Total_Revolving_Bal',
'Total_Trans_Amt', 'Total_Trans_Ct']
# boxplot and histogram for input numerical variable of all numerical variables- SL-W2-MLS10
# all_col = df.select_dtypes(include=np.number).columns.tolist() #colums with numeric values
# cont_num.remove("column") # Will not include 'column' for this portion of the EDA
for i in range(len(cont_num)): # loop to call def histogram_boxplot for all columns
histogram_boxplot(df_eda[cont_num[i]]) # Runs the histogram_boxplot defined function
plt.show() # display plots
# Grouping numerical attributes with float value data types
flt_num = ['Credit_Limit', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
# boxplot and histogram for input numerical variable of all numerical variables-SL-W2-MLS10
# all_col = df.select_dtypes(include=np.number).columns.tolist() #colums w/numeric values
# cont_num.remove("column") # Will not include 'column' for this portion of the EDA
for i in range(len(flt_num)): # loop to call def histogram_boxplot for all columns
histogram_boxplot(df_eda[flt_num[i]]) # Runs the histogram_boxplot defined function
plt.show() # display plots
Observations and Insights - Float Value Attributes - Univariate Analysis
# Grouping numerical attributes with less than 10 category type values - SL-C-W2-MLS 2
cat_num = ['Attrition_Flag', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon']
# Generating barplot that indicates percentage for each category - SL-C-W2-MLS 2
for i in cat_num:
plt.figure(figsize=(20,6))
ax = sns.countplot(df_eda[i],palette='winter')
perc_on_bar(ax,df[i])
Observations and Insights - Numerical Category Attributes - Univariate Analysis
# Column Range, min, max, quantile and outlier values for all numerical variables
all_col = df_eda.select_dtypes(include=np.number).columns.tolist() # colums with numeric values
for i in range(len(all_col)):
# Calculating Quantile Values Q1, Q2, Q3 (25%,50%, 75%) and IQR - PF-W3-3.7
Q1 = round(df_eda[all_col[i]].quantile(q=0.25)) #Q1 below which 25% of data lies-PF-W3-3.14
Q2 = round(df_eda[all_col[i]].quantile(q=0.50)) #Q2 below which 50% of data lies-PF-W3-3.14
Q3 = round(df_eda[all_col[i]].quantile(q=0.75)) #Q3 below which 75% of data lies-PF-W3-3.14
IRQ = round(df_eda[all_col[i]].quantile(0.75) - df[all_col[i]].quantile(0.25),2) #IQR Value
print(f'- {all_col[i]} Column -') # f-string display - SL-W2-2.2
# Range, min, and max - PF-W3-1.7
# source: https://www.datasciencemadesimple.com/get-maximum-value-column-python-pandas/
# Range of dataset
print('\t Range of min./max. values: ',
round(df[all_col[i]].max() - df[all_col[i]].min(),2))
# Displays Minimum Age within attribute
print('\t\t\tMin. value: ',df[all_col[i]].min(),)
# Displays Maximum Age within attribute
print('\t\t\tMax. value: ',df[all_col[i]].max(),)
# Mean function rounded to 1-decimal place
print('\t\t\tMean value: ',round(df[all_col[i]].mean(),1),)
# Displays the Median value of dataset
print('\t\t Median value: ',df[all_col[i]].median(),)
# Quantile Values Q1, Q2, Q3 (25%,50%, 75%) and IQR
print('\t\tQuantile values')
print('\t\t Q1 Quantile(25%): ',Q1,) # prints Q1
print('\t\t Q2 Quantile(50%): ',Q2,) # prints Q2
print('\t\t Q3 Quantile(75%): ',Q3,) # prints Q3
print('\t QR Meas. of Dispersion: ',IRQ,) # prints IRQ
# Identifying the outliers beyond the whiskers - PF MLS 5
print('\tOutliers beyond whiskers')
out_high = (IRQ*1.5)+Q3 # calculating outlier beyond the Q3 whisker
out_low = (IRQ*1.5)-Q1 # calculating outlier below the Q1 whisker
print ('\t Outliers greater than Q3: ', round(out_high,1)) # prints out_high
print ('\t Outliers less than Q1: ',round(out_low,1)) # prints out_high
print ('\n') # line space to separate columns
- Customer_Age Column - Range of min./max. values: 47 Min. value: 26 Max. value: 73 Mean value: 46.3 Median value: 46.0 Quantile values Q1 Quantile(25%): 41 Q2 Quantile(50%): 46 Q3 Quantile(75%): 52 QR Meas. of Dispersion: 11.0 Outliers beyond whiskers Outliers greater than Q3: 68.5 Outliers less than Q1: -24.5 - Dependent_count Column - Range of min./max. values: 5 Min. value: 0 Max. value: 5 Mean value: 2.3 Median value: 2.0 Quantile values Q1 Quantile(25%): 1 Q2 Quantile(50%): 2 Q3 Quantile(75%): 3 QR Meas. of Dispersion: 2.0 Outliers beyond whiskers Outliers greater than Q3: 6.0 Outliers less than Q1: 2.0 - Months_on_book Column - Range of min./max. values: 43 Min. value: 13 Max. value: 56 Mean value: 35.9 Median value: 36.0 Quantile values Q1 Quantile(25%): 31 Q2 Quantile(50%): 36 Q3 Quantile(75%): 40 QR Meas. of Dispersion: 9.0 Outliers beyond whiskers Outliers greater than Q3: 53.5 Outliers less than Q1: -17.5 - Total_Relationship_Count Column - Range of min./max. values: 5 Min. value: 1 Max. value: 6 Mean value: 3.8 Median value: 4.0 Quantile values Q1 Quantile(25%): 3 Q2 Quantile(50%): 4 Q3 Quantile(75%): 5 QR Meas. of Dispersion: 2.0 Outliers beyond whiskers Outliers greater than Q3: 8.0 Outliers less than Q1: 0.0 - Months_Inactive_12_mon Column - Range of min./max. values: 6 Min. value: 0 Max. value: 6 Mean value: 2.3 Median value: 2.0 Quantile values Q1 Quantile(25%): 2 Q2 Quantile(50%): 2 Q3 Quantile(75%): 3 QR Meas. of Dispersion: 1.0 Outliers beyond whiskers Outliers greater than Q3: 4.5 Outliers less than Q1: -0.5 - Contacts_Count_12_mon Column - Range of min./max. values: 6 Min. value: 0 Max. value: 6 Mean value: 2.5 Median value: 2.0 Quantile values Q1 Quantile(25%): 2 Q2 Quantile(50%): 2 Q3 Quantile(75%): 3 QR Meas. of Dispersion: 1.0 Outliers beyond whiskers Outliers greater than Q3: 4.5 Outliers less than Q1: -0.5 - Credit_Limit Column - Range of min./max. values: 33077.7 Min. value: 1438.3 Max. value: 34516.0 Mean value: 8632.0 Median value: 4549.0 Quantile values Q1 Quantile(25%): 2555 Q2 Quantile(50%): 4549 Q3 Quantile(75%): 11068 QR Meas. of Dispersion: 8512.5 Outliers beyond whiskers Outliers greater than Q3: 23836.8 Outliers less than Q1: 10213.8 - Total_Revolving_Bal Column - Range of min./max. values: 2517 Min. value: 0 Max. value: 2517 Mean value: 1162.8 Median value: 1276.0 Quantile values Q1 Quantile(25%): 359 Q2 Quantile(50%): 1276 Q3 Quantile(75%): 1784 QR Meas. of Dispersion: 1425.0 Outliers beyond whiskers Outliers greater than Q3: 3921.5 Outliers less than Q1: 1778.5 - Avg_Open_To_Buy Column - Range of min./max. values: 34513.0 Min. value: 3.0 Max. value: 34516.0 Mean value: 7469.1 Median value: 3474.0 Quantile values Q1 Quantile(25%): 1324 Q2 Quantile(50%): 3474 Q3 Quantile(75%): 9859 QR Meas. of Dispersion: 8534.5 Outliers beyond whiskers Outliers greater than Q3: 22660.8 Outliers less than Q1: 11477.8 - Total_Amt_Chng_Q4_Q1 Column - Range of min./max. values: 3.4 Min. value: 0.0 Max. value: 3.397 Mean value: 0.8 Median value: 0.736 Quantile values Q1 Quantile(25%): 1 Q2 Quantile(50%): 1 Q3 Quantile(75%): 1 QR Meas. of Dispersion: 0.23 Outliers beyond whiskers Outliers greater than Q3: 1.3 Outliers less than Q1: -0.7 - Total_Trans_Amt Column - Range of min./max. values: 17974 Min. value: 510 Max. value: 18484 Mean value: 4404.1 Median value: 3899.0 Quantile values Q1 Quantile(25%): 2156 Q2 Quantile(50%): 3899 Q3 Quantile(75%): 4741 QR Meas. of Dispersion: 2585.5 Outliers beyond whiskers Outliers greater than Q3: 8619.2 Outliers less than Q1: 1722.2 - Total_Trans_Ct Column - Range of min./max. values: 129 Min. value: 10 Max. value: 139 Mean value: 64.9 Median value: 67.0 Quantile values Q1 Quantile(25%): 45 Q2 Quantile(50%): 67 Q3 Quantile(75%): 81 QR Meas. of Dispersion: 36.0 Outliers beyond whiskers Outliers greater than Q3: 135.0 Outliers less than Q1: 9.0 - Total_Ct_Chng_Q4_Q1 Column - Range of min./max. values: 3.71 Min. value: 0.0 Max. value: 3.714 Mean value: 0.7 Median value: 0.702 Quantile values Q1 Quantile(25%): 1 Q2 Quantile(50%): 1 Q3 Quantile(75%): 1 QR Meas. of Dispersion: 0.24 Outliers beyond whiskers Outliers greater than Q3: 1.4 Outliers less than Q1: -0.6 - Avg_Utilization_Ratio Column - Range of min./max. values: 1.0 Min. value: 0.0 Max. value: 0.999 Mean value: 0.3 Median value: 0.176 Quantile values Q1 Quantile(25%): 0 Q2 Quantile(50%): 0 Q3 Quantile(75%): 1 QR Meas. of Dispersion: 0.48 Outliers beyond whiskers Outliers greater than Q3: 1.7 Outliers less than Q1: 0.7
Based upon the results above, non-category attributes regardless of the data type, should be further examined using both the numerical and box plot results
Next, we will group all non-categorical attributes into a list, then applying the below boxplot function that highly displays outliers for numerical columns
# Grouping non-categorical attributes for outliers
out_att = ['Customer_Age', 'Months_on_book', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy',
'Total_Trans_Amt','Total_Trans_Ct','Total_Amt_Chng_Q4_Q1','Total_Ct_Chng_Q4_Q1','Avg_Utilization_Ratio']
# Plot the boxplots of all non-categorical attributes to check for outliers - SL-W2 - 2.14
plt.figure(figsize=(20, 30)) # Figure size
for i, variable in enumerate(out_att): # for loop to apply to all numeric culumns
plt.subplot(5, 4, i + 1) # subplot settings
plt.boxplot(df_eda[variable], whis=1.5) # boxplot function
plt.tight_layout() # layout adjustment
plt.title(variable) # adding title to plots
plt.show() # display plot
df_eda[df_eda['Credit_Limit']>23836.8]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | Existing Customer | 51 | M | 4 | Unknown | Married | $120K + | Gold | 46 | 6 | 1 | 3 | 34516.0 | 2264 | 32252.0 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 7 | Existing Customer | 32 | M | 0 | High School | Unknown | $60K - $80K | Silver | 27 | 2 | 2 | 2 | 29081.0 | 1396 | 27685.0 | 2.204 | 1538 | 36 | 0.714 | 0.048 |
| 16 | Existing Customer | 48 | M | 4 | Post-Graduate | Single | $80K - $120K | Blue | 36 | 6 | 2 | 3 | 30367.0 | 2362 | 28005.0 | 1.708 | 1671 | 27 | 0.929 | 0.078 |
| 40 | Existing Customer | 45 | M | 3 | Graduate | Single | $80K - $120K | Blue | 41 | 2 | 2 | 2 | 32426.0 | 578 | 31848.0 | 1.042 | 1109 | 28 | 0.474 | 0.018 |
| 45 | Existing Customer | 49 | M | 4 | Uneducated | Single | $80K - $120K | Blue | 30 | 3 | 2 | 3 | 34516.0 | 0 | 34516.0 | 1.621 | 1444 | 28 | 1.333 | 0.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10098 | Attrited Customer | 55 | M | 3 | Graduate | Single | $120K + | Silver | 36 | 4 | 3 | 4 | 34516.0 | 0 | 34516.0 | 1.007 | 9931 | 70 | 0.750 | 0.000 |
| 10100 | Existing Customer | 39 | M | 2 | Graduate | Unknown | $60K - $80K | Silver | 36 | 4 | 2 | 2 | 29808.0 | 0 | 29808.0 | 0.669 | 16098 | 128 | 0.684 | 0.000 |
| 10104 | Existing Customer | 51 | M | 3 | Graduate | Single | $60K - $80K | Silver | 36 | 3 | 2 | 2 | 29663.0 | 1743 | 27920.0 | 0.667 | 14638 | 93 | 0.722 | 0.059 |
| 10110 | Existing Customer | 56 | M | 1 | Graduate | Single | $80K - $120K | Silver | 49 | 5 | 2 | 2 | 34516.0 | 1091 | 33425.0 | 0.640 | 15274 | 108 | 0.714 | 0.032 |
| 10112 | Attrited Customer | 33 | M | 2 | College | Married | $120K + | Gold | 20 | 2 | 1 | 4 | 34516.0 | 0 | 34516.0 | 1.004 | 9338 | 73 | 0.622 | 0.000 |
984 rows × 20 columns
df_eda[df_eda['Avg_Open_To_Buy']>22660.8]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | Existing Customer | 51 | M | 4 | Unknown | Married | $120K + | Gold | 46 | 6 | 1 | 3 | 34516.0 | 2264 | 32252.0 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 7 | Existing Customer | 32 | M | 0 | High School | Unknown | $60K - $80K | Silver | 27 | 2 | 2 | 2 | 29081.0 | 1396 | 27685.0 | 2.204 | 1538 | 36 | 0.714 | 0.048 |
| 16 | Existing Customer | 48 | M | 4 | Post-Graduate | Single | $80K - $120K | Blue | 36 | 6 | 2 | 3 | 30367.0 | 2362 | 28005.0 | 1.708 | 1671 | 27 | 0.929 | 0.078 |
| 40 | Existing Customer | 45 | M | 3 | Graduate | Single | $80K - $120K | Blue | 41 | 2 | 2 | 2 | 32426.0 | 578 | 31848.0 | 1.042 | 1109 | 28 | 0.474 | 0.018 |
| 45 | Existing Customer | 49 | M | 4 | Uneducated | Single | $80K - $120K | Blue | 30 | 3 | 2 | 3 | 34516.0 | 0 | 34516.0 | 1.621 | 1444 | 28 | 1.333 | 0.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10100 | Existing Customer | 39 | M | 2 | Graduate | Unknown | $60K - $80K | Silver | 36 | 4 | 2 | 2 | 29808.0 | 0 | 29808.0 | 0.669 | 16098 | 128 | 0.684 | 0.000 |
| 10103 | Attrited Customer | 51 | M | 1 | High School | Married | $80K - $120K | Blue | 36 | 4 | 3 | 4 | 22754.0 | 0 | 22754.0 | 0.799 | 8531 | 77 | 0.791 | 0.000 |
| 10104 | Existing Customer | 51 | M | 3 | Graduate | Single | $60K - $80K | Silver | 36 | 3 | 2 | 2 | 29663.0 | 1743 | 27920.0 | 0.667 | 14638 | 93 | 0.722 | 0.059 |
| 10110 | Existing Customer | 56 | M | 1 | Graduate | Single | $80K - $120K | Silver | 49 | 5 | 2 | 2 | 34516.0 | 1091 | 33425.0 | 0.640 | 15274 | 108 | 0.714 | 0.032 |
| 10112 | Attrited Customer | 33 | M | 2 | College | Married | $120K + | Gold | 20 | 2 | 1 | 4 | 34516.0 | 0 | 34516.0 | 1.004 | 9338 | 73 | 0.622 | 0.000 |
963 rows × 20 columns
df_eda[df_eda['Total_Amt_Chng_Q4_Q1']>1.3]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4140 | Existing Customer | 45 | F | 2 | High School | Single | Unknown | Blue | 37 | 4 | 2 | 4 | 27804.0 | 0 | 27804.0 | 1.301 | 2761 | 57 | 0.966 | 0.000 |
| 5679 | Existing Customer | 44 | F | 3 | Unknown | Divorced | Less than $40K | Blue | 24 | 4 | 1 | 2 | 2617.0 | 1508 | 1109.0 | 1.393 | 4400 | 60 | 0.818 | 0.576 |
| 6303 | Existing Customer | 43 | M | 2 | High School | Married | $60K - $80K | Blue | 25 | 3 | 2 | 3 | 2049.0 | 1530 | 519.0 | 1.344 | 4521 | 70 | 0.628 | 0.747 |
| 7438 | Existing Customer | 42 | F | 3 | Graduate | Married | Less than $40K | Blue | 37 | 4 | 2 | 1 | 2730.0 | 1303 | 1427.0 | 1.308 | 4750 | 76 | 0.900 | 0.477 |
| 9695 | Attrited Customer | 41 | F | 4 | Unknown | Single | Unknown | Blue | 25 | 2 | 2 | 3 | 29295.0 | 0 | 29295.0 | 1.411 | 9865 | 65 | 0.711 | 0.000 |
238 rows × 20 columns
#### Checking the number of customers with high Total Transaction Amounts
df_eda[df_eda['Total_Trans_Amt']>8619.2]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8591 | Existing Customer | 50 | M | 3 | High School | Single | $80K - $120K | Blue | 39 | 3 | 3 | 2 | 21322.0 | 2216 | 19106.0 | 0.921 | 8693 | 91 | 0.569 | 0.104 |
| 8650 | Existing Customer | 43 | F | 2 | College | Married | Less than $40K | Blue | 37 | 3 | 3 | 1 | 5380.0 | 1337 | 4043.0 | 0.845 | 8947 | 93 | 0.576 | 0.249 |
| 8670 | Existing Customer | 39 | F | 4 | Doctorate | Married | Less than $40K | Blue | 27 | 2 | 1 | 1 | 8587.0 | 1324 | 7263.0 | 0.897 | 8854 | 92 | 0.643 | 0.154 |
| 8708 | Existing Customer | 51 | F | 2 | Unknown | Married | $40K - $60K | Blue | 36 | 2 | 2 | 1 | 14800.0 | 1020 | 13780.0 | 0.896 | 8796 | 94 | 0.593 | 0.069 |
| 8734 | Existing Customer | 45 | M | 5 | Unknown | Married | $60K - $80K | Blue | 38 | 1 | 2 | 3 | 6895.0 | 2314 | 4581.0 | 0.840 | 8778 | 85 | 0.604 | 0.336 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10121 | Existing Customer | 56 | F | 1 | Graduate | Single | Less than $40K | Blue | 50 | 4 | 1 | 4 | 3688.0 | 606 | 3082.0 | 0.570 | 14596 | 120 | 0.791 | 0.164 |
| 10122 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | Attrited Customer | 41 | M | 2 | Unknown | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10126 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
896 rows × 20 columns
#### Checking the number of customers with high Total Transaction Count Change Q4_Q1 Ratios
df_eda[df_eda['Total_Ct_Chng_Q4_Q1']>1.4]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| 12 | Existing Customer | 56 | M | 1 | College | Single | $80K - $120K | Blue | 36 | 3 | 6 | 0 | 11751.0 | 0 | 11751.0 | 3.397 | 1539 | 17 | 3.250 | 0.000 |
| 13 | Existing Customer | 35 | M | 3 | Graduate | Unknown | $60K - $80K | Blue | 30 | 5 | 1 | 3 | 8547.0 | 1666 | 6881.0 | 1.163 | 1311 | 33 | 2.000 | 0.195 |
| 15 | Existing Customer | 44 | M | 4 | Unknown | Unknown | $80K - $120K | Blue | 37 | 5 | 1 | 2 | 4234.0 | 972 | 3262.0 | 1.707 | 1348 | 27 | 1.700 | 0.230 |
| 17 | Existing Customer | 41 | M | 3 | Unknown | Married | $80K - $120K | Blue | 34 | 4 | 4 | 1 | 13535.0 | 1291 | 12244.0 | 0.653 | 1028 | 21 | 1.625 | 0.095 |
| 22 | Existing Customer | 41 | M | 3 | High School | Married | $40K - $60K | Blue | 33 | 4 | 2 | 1 | 4470.0 | 680 | 3790.0 | 1.608 | 931 | 18 | 1.571 | 0.152 |
| 28 | Existing Customer | 44 | F | 3 | Uneducated | Single | Unknown | Blue | 34 | 5 | 2 | 2 | 10100.0 | 0 | 10100.0 | 0.525 | 1052 | 18 | 1.571 | 0.000 |
| 30 | Existing Customer | 53 | M | 3 | Unknown | Married | $80K - $120K | Blue | 33 | 3 | 2 | 3 | 2753.0 | 1811 | 942.0 | 0.977 | 1038 | 25 | 2.571 | 0.658 |
| 31 | Existing Customer | 53 | M | 2 | Uneducated | Married | $60K - $80K | Blue | 48 | 2 | 5 | 1 | 2451.0 | 1690 | 761.0 | 1.323 | 1596 | 26 | 1.600 | 0.690 |
| 32 | Existing Customer | 41 | M | 4 | Graduate | Married | $60K - $80K | Blue | 36 | 4 | 1 | 2 | 8923.0 | 2517 | 6406.0 | 1.726 | 1589 | 24 | 1.667 | 0.282 |
| 47 | Existing Customer | 59 | M | 1 | Doctorate | Married | $40K - $60K | Blue | 52 | 3 | 2 | 2 | 2548.0 | 2020 | 528.0 | 2.357 | 1719 | 27 | 1.700 | 0.793 |
| 52 | Existing Customer | 66 | F | 0 | High School | Married | Less than $40K | Blue | 54 | 3 | 4 | 2 | 3171.0 | 2179 | 992.0 | 1.224 | 1946 | 38 | 1.923 | 0.687 |
| 57 | Existing Customer | 56 | M | 2 | Uneducated | Married | $40K - $60K | Blue | 50 | 4 | 2 | 3 | 4458.0 | 1880 | 2578.0 | 1.107 | 1424 | 29 | 1.417 | 0.422 |
| 68 | Existing Customer | 49 | M | 2 | Graduate | Married | $60K - $80K | Blue | 32 | 2 | 2 | 2 | 1687.0 | 1107 | 580.0 | 1.715 | 1670 | 17 | 2.400 | 0.656 |
| 69 | Existing Customer | 50 | M | 2 | Doctorate | Married | $80K - $120K | Blue | 38 | 6 | 2 | 2 | 25300.0 | 1330 | 23970.0 | 1.072 | 837 | 15 | 2.000 | 0.053 |
| 76 | Existing Customer | 44 | F | 4 | Graduate | Single | Less than $40K | Blue | 36 | 6 | 4 | 2 | 7000.0 | 2517 | 4483.0 | 0.475 | 1112 | 23 | 1.875 | 0.360 |
| 80 | Existing Customer | 47 | M | 3 | Graduate | Married | $60K - $80K | Blue | 37 | 4 | 4 | 0 | 8567.0 | 1695 | 6872.0 | 1.242 | 1457 | 41 | 1.412 | 0.198 |
| 84 | Existing Customer | 53 | M | 1 | Graduate | Divorced | $80K - $120K | Blue | 35 | 5 | 4 | 2 | 34516.0 | 1219 | 33297.0 | 1.129 | 1590 | 27 | 2.000 | 0.035 |
| 88 | Existing Customer | 44 | M | 3 | High School | Single | $60K - $80K | Blue | 31 | 4 | 3 | 1 | 12756.0 | 837 | 11919.0 | 1.932 | 1413 | 14 | 1.800 | 0.066 |
| 89 | Existing Customer | 57 | M | 2 | Unknown | Married | $120K + | Blue | 45 | 5 | 3 | 3 | 5266.0 | 0 | 5266.0 | 1.702 | 1516 | 29 | 1.636 | 0.000 |
| 91 | Existing Customer | 49 | M | 4 | High School | Single | $80K - $120K | Blue | 38 | 4 | 3 | 0 | 31302.0 | 1953 | 29349.0 | 0.875 | 1564 | 35 | 2.182 | 0.062 |
| 103 | Existing Customer | 44 | F | 2 | Graduate | Single | Less than $40K | Blue | 32 | 3 | 3 | 3 | 3031.0 | 1879 | 1152.0 | 0.788 | 1298 | 15 | 1.500 | 0.620 |
| 111 | Existing Customer | 42 | M | 3 | Unknown | Married | $60K - $80K | Blue | 36 | 5 | 3 | 3 | 15088.0 | 865 | 14223.0 | 0.939 | 1107 | 21 | 1.625 | 0.057 |
| 113 | Existing Customer | 54 | F | 0 | Uneducated | Married | Less than $40K | Blue | 36 | 2 | 2 | 2 | 1494.0 | 706 | 788.0 | 1.674 | 1305 | 24 | 3.000 | 0.473 |
| 115 | Existing Customer | 49 | M | 1 | Graduate | Single | $80K - $120K | Blue | 36 | 6 | 2 | 2 | 18886.0 | 895 | 17991.0 | 1.826 | 1235 | 18 | 1.571 | 0.047 |
| 122 | Existing Customer | 53 | M | 1 | High School | Married | $80K - $120K | Blue | 47 | 3 | 2 | 2 | 12262.0 | 1836 | 10426.0 | 1.584 | 1659 | 38 | 1.533 | 0.150 |
| 128 | Existing Customer | 52 | M | 2 | High School | Single | $40K - $60K | Blue | 45 | 3 | 1 | 2 | 13532.0 | 1300 | 12232.0 | 0.846 | 1521 | 29 | 1.417 | 0.096 |
| 131 | Existing Customer | 43 | M | 4 | Unknown | Married | $40K - $60K | Blue | 39 | 5 | 1 | 2 | 6111.0 | 2517 | 3594.0 | 0.632 | 1221 | 16 | 2.200 | 0.412 |
| 132 | Existing Customer | 57 | M | 4 | Graduate | Married | $80K - $120K | Blue | 46 | 3 | 2 | 3 | 19270.0 | 1662 | 17608.0 | 1.186 | 1565 | 28 | 1.545 | 0.086 |
| 142 | Existing Customer | 54 | M | 4 | Graduate | Married | $80K - $120K | Blue | 34 | 2 | 3 | 2 | 14926.0 | 2517 | 12409.0 | 1.996 | 1576 | 25 | 1.500 | 0.169 |
| 146 | Existing Customer | 41 | F | 2 | Graduate | Single | Less than $40K | Blue | 32 | 6 | 3 | 2 | 2250.0 | 2117 | 133.0 | 1.162 | 1617 | 31 | 2.875 | 0.941 |
| 151 | Existing Customer | 68 | M | 1 | Graduate | Married | Unknown | Blue | 56 | 5 | 2 | 3 | 13860.0 | 1652 | 12208.0 | 1.255 | 1910 | 32 | 1.909 | 0.119 |
| 158 | Existing Customer | 44 | F | 2 | Uneducated | Married | Unknown | Silver | 35 | 4 | 3 | 2 | 32643.0 | 0 | 32643.0 | 1.300 | 1058 | 24 | 2.429 | 0.000 |
| 162 | Existing Customer | 46 | M | 1 | Uneducated | Married | $60K - $80K | Blue | 36 | 3 | 2 | 2 | 20459.0 | 2071 | 18388.0 | 1.248 | 1767 | 38 | 2.167 | 0.101 |
| 167 | Existing Customer | 58 | F | 4 | Uneducated | Married | Less than $40K | Blue | 47 | 3 | 3 | 3 | 2822.0 | 2173 | 649.0 | 1.456 | 1218 | 23 | 2.286 | 0.770 |
| 170 | Existing Customer | 45 | F | 4 | Uneducated | Divorced | Unknown | Blue | 36 | 4 | 3 | 3 | 15875.0 | 1719 | 14156.0 | 0.860 | 1592 | 37 | 1.467 | 0.108 |
| 178 | Existing Customer | 57 | F | 2 | Uneducated | Single | $40K - $60K | Silver | 36 | 6 | 3 | 0 | 19482.0 | 1072 | 18410.0 | 0.706 | 1421 | 22 | 1.444 | 0.055 |
| 190 | Existing Customer | 57 | M | 1 | Graduate | Married | $80K - $120K | Blue | 47 | 5 | 3 | 1 | 14612.0 | 1976 | 12636.0 | 1.768 | 1827 | 24 | 3.000 | 0.135 |
| 199 | Existing Customer | 59 | M | 1 | College | Married | $60K - $80K | Blue | 36 | 4 | 1 | 2 | 7871.0 | 1687 | 6184.0 | 1.242 | 1428 | 22 | 1.444 | 0.214 |
| 231 | Existing Customer | 57 | M | 2 | Unknown | Married | $80K - $120K | Blue | 46 | 2 | 3 | 0 | 18871.0 | 1740 | 17131.0 | 1.727 | 1516 | 21 | 2.000 | 0.092 |
| 235 | Existing Customer | 55 | M | 2 | College | Married | $60K - $80K | Blue | 48 | 4 | 3 | 2 | 8863.0 | 1301 | 7562.0 | 1.207 | 1512 | 27 | 1.455 | 0.147 |
| 239 | Existing Customer | 44 | M | 4 | Unknown | Single | $80K - $120K | Blue | 34 | 4 | 1 | 2 | 21573.0 | 1585 | 19988.0 | 0.621 | 1384 | 36 | 2.273 | 0.073 |
| 251 | Existing Customer | 73 | M | 0 | High School | Married | $40K - $60K | Blue | 36 | 5 | 3 | 2 | 4469.0 | 1125 | 3344.0 | 1.363 | 1765 | 34 | 1.615 | 0.252 |
| 255 | Existing Customer | 55 | F | 2 | High School | Married | Less than $40K | Blue | 36 | 4 | 3 | 3 | 2358.0 | 1152 | 1206.0 | 0.969 | 1737 | 26 | 1.600 | 0.489 |
| 266 | Existing Customer | 51 | F | 3 | Unknown | Unknown | Less than $40K | Blue | 25 | 2 | 2 | 2 | 9512.0 | 1082 | 8430.0 | 0.701 | 1007 | 22 | 1.444 | 0.114 |
| 269 | Existing Customer | 54 | M | 5 | Graduate | Married | $60K - $80K | Blue | 38 | 3 | 3 | 3 | 2290.0 | 1434 | 856.0 | 0.923 | 1119 | 18 | 3.500 | 0.626 |
| 280 | Existing Customer | 43 | M | 1 | Graduate | Single | $80K - $120K | Silver | 37 | 4 | 3 | 2 | 34516.0 | 1440 | 33076.0 | 1.117 | 1575 | 34 | 2.400 | 0.042 |
| 281 | Existing Customer | 41 | M | 4 | High School | Unknown | $80K - $120K | Blue | 36 | 4 | 3 | 0 | 23018.0 | 2168 | 20850.0 | 0.859 | 1463 | 24 | 1.667 | 0.094 |
| 282 | Existing Customer | 54 | F | 1 | Uneducated | Married | $40K - $60K | Blue | 44 | 5 | 2 | 3 | 5556.0 | 1711 | 3845.0 | 1.434 | 1706 | 21 | 1.625 | 0.308 |
| 284 | Existing Customer | 61 | M | 0 | Graduate | Married | $40K - $60K | Blue | 52 | 3 | 1 | 2 | 2939.0 | 1999 | 940.0 | 2.145 | 2434 | 33 | 1.538 | 0.680 |
| 294 | Existing Customer | 45 | M | 2 | College | Single | $60K - $80K | Blue | 33 | 6 | 3 | 0 | 23218.0 | 1814 | 21404.0 | 1.178 | 1749 | 37 | 2.083 | 0.078 |
| 295 | Existing Customer | 60 | M | 0 | High School | Married | $40K - $60K | Blue | 36 | 5 | 1 | 3 | 3281.0 | 837 | 2444.0 | 1.859 | 1424 | 29 | 1.417 | 0.255 |
| 300 | Existing Customer | 50 | M | 0 | High School | Single | $120K + | Silver | 45 | 5 | 2 | 3 | 14938.0 | 2303 | 12635.0 | 0.804 | 949 | 27 | 2.000 | 0.154 |
| 309 | Existing Customer | 43 | F | 4 | Uneducated | Married | Less than $40K | Blue | 36 | 2 | 1 | 1 | 9684.0 | 2475 | 7209.0 | 0.697 | 1400 | 31 | 2.100 | 0.256 |
| 315 | Existing Customer | 38 | F | 2 | Graduate | Single | Unknown | Blue | 31 | 2 | 3 | 2 | 14041.0 | 2154 | 11887.0 | 1.084 | 1467 | 29 | 1.417 | 0.153 |
| 322 | Existing Customer | 51 | F | 2 | Uneducated | Divorced | Less than $40K | Blue | 38 | 4 | 1 | 2 | 9648.0 | 1926 | 7722.0 | 0.549 | 1083 | 22 | 1.750 | 0.200 |
| 323 | Existing Customer | 52 | M | 2 | High School | Single | $80K - $120K | Silver | 40 | 3 | 3 | 3 | 34516.0 | 2177 | 32339.0 | 1.320 | 1234 | 23 | 1.875 | 0.063 |
| 324 | Existing Customer | 55 | F | 2 | Uneducated | Married | Unknown | Blue | 35 | 3 | 3 | 0 | 3025.0 | 2376 | 649.0 | 1.252 | 1277 | 23 | 1.875 | 0.785 |
| 327 | Existing Customer | 44 | F | 2 | Post-Graduate | Married | Less than $40K | Blue | 36 | 3 | 3 | 2 | 2293.0 | 1823 | 470.0 | 0.617 | 1166 | 22 | 1.444 | 0.795 |
| 346 | Existing Customer | 58 | F | 3 | High School | Married | Less than $40K | Blue | 47 | 6 | 1 | 0 | 2609.0 | 2439 | 170.0 | 1.269 | 1731 | 34 | 1.833 | 0.935 |
| 350 | Existing Customer | 54 | M | 2 | Graduate | Married | Less than $40K | Blue | 39 | 4 | 4 | 2 | 2702.0 | 2517 | 185.0 | 1.019 | 1658 | 32 | 1.462 | 0.932 |
| 355 | Existing Customer | 46 | M | 4 | Graduate | Single | $60K - $80K | Blue | 39 | 6 | 2 | 3 | 10741.0 | 755 | 9986.0 | 1.166 | 1278 | 22 | 1.444 | 0.070 |
| 366 | Existing Customer | 36 | F | 4 | Graduate | Married | $40K - $60K | Blue | 36 | 6 | 3 | 3 | 1628.0 | 969 | 659.0 | 0.999 | 1893 | 15 | 2.750 | 0.595 |
| 392 | Existing Customer | 56 | M | 3 | High School | Married | $60K - $80K | Blue | 36 | 5 | 1 | 3 | 9226.0 | 1233 | 7993.0 | 1.521 | 1626 | 29 | 1.636 | 0.134 |
| 418 | Existing Customer | 44 | M | 5 | Uneducated | Married | $60K - $80K | Blue | 36 | 3 | 3 | 2 | 8645.0 | 1454 | 7191.0 | 0.708 | 1435 | 25 | 1.778 | 0.168 |
| 432 | Existing Customer | 67 | F | 0 | Unknown | Married | Unknown | Blue | 56 | 4 | 3 | 3 | 10509.0 | 2388 | 8121.0 | 0.777 | 1365 | 34 | 1.429 | 0.227 |
| 438 | Existing Customer | 40 | M | 4 | Doctorate | Divorced | $60K - $80K | Blue | 34 | 4 | 2 | 3 | 23138.0 | 1727 | 21411.0 | 0.693 | 1612 | 39 | 1.438 | 0.075 |
| 456 | Existing Customer | 47 | M | 3 | Uneducated | Married | $80K - $120K | Blue | 36 | 6 | 2 | 2 | 29770.0 | 1565 | 28205.0 | 0.674 | 1774 | 42 | 2.000 | 0.053 |
| 457 | Existing Customer | 46 | F | 2 | Post-Graduate | Married | Less than $40K | Blue | 28 | 5 | 1 | 1 | 5014.0 | 1910 | 3104.0 | 0.863 | 1209 | 25 | 1.500 | 0.381 |
| 466 | Existing Customer | 63 | M | 2 | Graduate | Married | $60K - $80K | Blue | 49 | 5 | 2 | 3 | 14035.0 | 2061 | 11974.0 | 2.271 | 1606 | 30 | 1.500 | 0.147 |
| 479 | Existing Customer | 58 | M | 4 | Unknown | Married | $60K - $80K | Blue | 44 | 5 | 1 | 3 | 12010.0 | 2149 | 9861.0 | 0.801 | 1700 | 35 | 1.500 | 0.179 |
| 533 | Existing Customer | 59 | F | 0 | Unknown | Married | Less than $40K | Blue | 53 | 4 | 3 | 2 | 5512.0 | 1124 | 4388.0 | 1.391 | 1975 | 49 | 1.579 | 0.204 |
| 547 | Existing Customer | 61 | F | 1 | Unknown | Married | Unknown | Blue | 47 | 4 | 2 | 3 | 1438.3 | 0 | 1438.3 | 1.337 | 1449 | 27 | 1.455 | 0.000 |
| 621 | Existing Customer | 52 | M | 2 | High School | Married | $120K + | Blue | 36 | 5 | 1 | 2 | 2981.0 | 1358 | 1623.0 | 1.242 | 1648 | 30 | 1.500 | 0.456 |
| 685 | Existing Customer | 33 | M | 2 | Doctorate | Married | $60K - $80K | Blue | 28 | 3 | 3 | 3 | 4049.0 | 0 | 4049.0 | 0.898 | 1742 | 35 | 1.500 | 0.000 |
| 697 | Existing Customer | 55 | M | 2 | High School | Married | $80K - $120K | Blue | 36 | 3 | 2 | 3 | 6514.0 | 2258 | 4256.0 | 0.785 | 1383 | 20 | 1.857 | 0.347 |
| 742 | Existing Customer | 43 | M | 2 | High School | Single | $60K - $80K | Blue | 36 | 4 | 3 | 2 | 6976.0 | 0 | 6976.0 | 0.519 | 1217 | 39 | 1.438 | 0.000 |
| 757 | Attrited Customer | 43 | M | 2 | Uneducated | Married | $60K - $80K | Blue | 24 | 2 | 3 | 3 | 14447.0 | 0 | 14447.0 | 0.856 | 915 | 29 | 2.222 | 0.000 |
| 760 | Existing Customer | 62 | F | 0 | Unknown | Married | $40K - $60K | Blue | 36 | 4 | 2 | 2 | 2964.0 | 2400 | 564.0 | 1.296 | 1557 | 25 | 1.778 | 0.810 |
| 761 | Existing Customer | 58 | M | 0 | Uneducated | Married | $80K - $120K | Blue | 38 | 6 | 3 | 3 | 3785.0 | 1935 | 1850.0 | 1.558 | 1824 | 49 | 1.450 | 0.511 |
| 773 | Existing Customer | 61 | M | 0 | Post-Graduate | Married | Unknown | Blue | 53 | 6 | 2 | 3 | 14434.0 | 1927 | 12507.0 | 2.675 | 1731 | 32 | 3.571 | 0.134 |
| 805 | Existing Customer | 29 | M | 0 | Post-Graduate | Single | $40K - $60K | Blue | 19 | 4 | 1 | 2 | 13632.0 | 1482 | 12150.0 | 0.893 | 2168 | 42 | 2.500 | 0.109 |
| 968 | Existing Customer | 54 | M | 2 | Graduate | Married | $60K - $80K | Blue | 42 | 6 | 2 | 0 | 3583.0 | 0 | 3583.0 | 1.088 | 1065 | 28 | 1.545 | 0.000 |
| 1041 | Existing Customer | 36 | F | 3 | High School | Married | Less than $40K | Blue | 26 | 6 | 1 | 4 | 4331.0 | 1445 | 2886.0 | 1.070 | 2053 | 33 | 1.750 | 0.334 |
| 1088 | Attrited Customer | 35 | M | 3 | College | Married | $80K - $120K | Blue | 27 | 3 | 2 | 3 | 3575.0 | 1884 | 1691.0 | 0.384 | 861 | 15 | 1.500 | 0.527 |
| 1095 | Existing Customer | 54 | F | 3 | Doctorate | Married | $40K - $60K | Blue | 39 | 3 | 1 | 3 | 2112.0 | 1594 | 518.0 | 1.204 | 1338 | 29 | 2.222 | 0.755 |
| 1153 | Existing Customer | 46 | F | 3 | High School | Married | Unknown | Blue | 36 | 5 | 2 | 2 | 4668.0 | 0 | 4668.0 | 0.890 | 1612 | 47 | 1.474 | 0.000 |
| 1187 | Existing Customer | 51 | F | 4 | Graduate | Single | Unknown | Blue | 34 | 3 | 1 | 2 | 11221.0 | 1279 | 9942.0 | 1.116 | 1564 | 45 | 1.647 | 0.114 |
| 1256 | Existing Customer | 43 | M | 2 | High School | Divorced | $80K - $120K | Blue | 33 | 4 | 3 | 2 | 3083.0 | 1898 | 1185.0 | 1.042 | 1448 | 30 | 2.000 | 0.616 |
| 1296 | Existing Customer | 39 | M | 3 | Unknown | Single | $40K - $60K | Blue | 24 | 3 | 1 | 4 | 4562.0 | 1721 | 2841.0 | 0.896 | 1896 | 50 | 1.500 | 0.377 |
| 1374 | Existing Customer | 61 | F | 0 | Unknown | Unknown | Unknown | Blue | 36 | 4 | 2 | 2 | 1801.0 | 0 | 1801.0 | 1.234 | 3193 | 62 | 1.480 | 0.000 |
| 1455 | Existing Customer | 39 | F | 2 | Doctorate | Married | Unknown | Blue | 36 | 5 | 2 | 4 | 8058.0 | 791 | 7267.0 | 1.787 | 2742 | 42 | 2.000 | 0.098 |
| 1479 | Existing Customer | 55 | F | 2 | Unknown | Married | Unknown | Blue | 36 | 5 | 5 | 4 | 3602.0 | 2013 | 1589.0 | 0.839 | 1935 | 47 | 1.474 | 0.559 |
| 1778 | Existing Customer | 38 | F | 2 | Uneducated | Married | Less than $40K | Blue | 28 | 3 | 3 | 3 | 7049.0 | 0 | 7049.0 | 0.988 | 1475 | 27 | 1.700 | 0.000 |
| 1972 | Attrited Customer | 58 | M | 0 | High School | Unknown | $40K - $60K | Blue | 42 | 3 | 2 | 3 | 1438.3 | 0 | 1438.3 | 0.835 | 1088 | 22 | 1.444 | 0.000 |
| 2076 | Existing Customer | 57 | M | 4 | Graduate | Married | $120K + | Blue | 51 | 4 | 3 | 1 | 10711.0 | 1970 | 8741.0 | 0.710 | 1180 | 29 | 1.417 | 0.184 |
| 2099 | Existing Customer | 65 | F | 1 | Graduate | Single | Unknown | Blue | 55 | 5 | 2 | 4 | 6463.0 | 1660 | 4803.0 | 0.995 | 2336 | 49 | 1.722 | 0.257 |
| 2358 | Existing Customer | 28 | M | 0 | High School | Married | Unknown | Blue | 21 | 5 | 3 | 5 | 25618.0 | 1209 | 24409.0 | 1.228 | 2157 | 49 | 1.882 | 0.047 |
| 2386 | Existing Customer | 58 | F | 2 | Unknown | Married | Less than $40K | Blue | 43 | 6 | 2 | 2 | 3034.0 | 2517 | 517.0 | 0.630 | 1617 | 29 | 1.417 | 0.830 |
| 2471 | Existing Customer | 54 | F | 1 | Graduate | Married | Less than $40K | Blue | 41 | 3 | 2 | 4 | 4658.0 | 588 | 4070.0 | 0.856 | 1351 | 30 | 1.500 | 0.126 |
| 2510 | Attrited Customer | 54 | M | 3 | High School | Married | $40K - $60K | Blue | 44 | 2 | 1 | 5 | 3032.0 | 0 | 3032.0 | 0.949 | 1037 | 14 | 2.500 | 0.000 |
| 2565 | Existing Customer | 39 | M | 3 | Graduate | Married | $120K + | Blue | 36 | 3 | 3 | 2 | 32964.0 | 2231 | 30733.0 | 1.731 | 3094 | 45 | 1.647 | 0.068 |
| 2683 | Existing Customer | 33 | F | 3 | Post-Graduate | Married | $40K - $60K | Blue | 28 | 5 | 2 | 3 | 1864.0 | 1081 | 783.0 | 1.416 | 2607 | 44 | 1.750 | 0.580 |
| 2696 | Existing Customer | 47 | F | 3 | Uneducated | Married | Less than $40K | Blue | 32 | 4 | 3 | 2 | 2821.0 | 951 | 1870.0 | 0.577 | 1452 | 38 | 1.923 | 0.337 |
| 2982 | Existing Customer | 45 | M | 4 | Unknown | Married | $40K - $60K | Blue | 30 | 6 | 3 | 3 | 1892.0 | 1049 | 843.0 | 1.122 | 4558 | 60 | 1.609 | 0.554 |
| 3221 | Existing Customer | 35 | M | 3 | High School | Unknown | $80K - $120K | Blue | 30 | 5 | 3 | 3 | 11229.0 | 1054 | 10175.0 | 0.552 | 1898 | 36 | 1.571 | 0.094 |
| 3270 | Existing Customer | 49 | M | 3 | High School | Unknown | $60K - $80K | Blue | 36 | 3 | 2 | 2 | 9551.0 | 1833 | 7718.0 | 1.675 | 3213 | 52 | 1.476 | 0.192 |
| 3330 | Existing Customer | 49 | M | 4 | College | Divorced | $40K - $60K | Blue | 43 | 3 | 2 | 3 | 1922.0 | 710 | 1212.0 | 0.951 | 2920 | 46 | 1.421 | 0.369 |
| 3430 | Existing Customer | 42 | M | 5 | Graduate | Married | $60K - $80K | Blue | 27 | 3 | 2 | 3 | 11880.0 | 2025 | 9855.0 | 0.835 | 1789 | 42 | 1.471 | 0.170 |
| 4086 | Existing Customer | 50 | M | 3 | College | Single | $40K - $60K | Blue | 36 | 3 | 2 | 2 | 3069.0 | 1881 | 1188.0 | 1.111 | 2909 | 58 | 1.522 | 0.613 |
| 4146 | Existing Customer | 48 | M | 4 | Doctorate | Unknown | $80K - $120K | Blue | 40 | 3 | 2 | 2 | 3429.0 | 1268 | 2161.0 | 0.703 | 4422 | 64 | 1.783 | 0.370 |
| 4597 | Existing Customer | 39 | M | 2 | Graduate | Married | $60K - $80K | Blue | 23 | 5 | 2 | 1 | 4761.0 | 1561 | 3200.0 | 0.690 | 2045 | 53 | 1.650 | 0.328 |
| 5260 | Existing Customer | 54 | F | 2 | Unknown | Single | Less than $40K | Blue | 47 | 3 | 2 | 3 | 2341.0 | 1550 | 791.0 | 0.656 | 3406 | 57 | 1.478 | 0.662 |
| 5762 | Existing Customer | 42 | F | 2 | High School | Married | Less than $40K | Blue | 31 | 3 | 5 | 2 | 3964.0 | 1218 | 2746.0 | 0.593 | 4580 | 75 | 1.419 | 0.307 |
| 6566 | Existing Customer | 43 | F | 3 | Graduate | Single | $40K - $60K | Blue | 33 | 3 | 2 | 2 | 2900.0 | 2122 | 778.0 | 0.631 | 4656 | 59 | 1.458 | 0.732 |
| 7134 | Existing Customer | 46 | M | 3 | High School | Married | $60K - $80K | Blue | 36 | 4 | 3 | 1 | 4696.0 | 0 | 4696.0 | 0.732 | 4828 | 62 | 1.583 | 0.000 |
| 7408 | Existing Customer | 50 | F | 3 | Graduate | Married | Less than $40K | Blue | 40 | 3 | 2 | 2 | 2925.0 | 2065 | 860.0 | 0.692 | 3913 | 47 | 1.474 | 0.706 |
| 8393 | Existing Customer | 44 | F | 2 | Graduate | Unknown | Less than $40K | Blue | 39 | 2 | 3 | 3 | 4057.0 | 755 | 3302.0 | 0.873 | 4301 | 58 | 1.417 | 0.186 |
| 9977 | Attrited Customer | 41 | F | 3 | Uneducated | Married | Less than $40K | Blue | 30 | 1 | 2 | 3 | 9491.0 | 0 | 9491.0 | 1.056 | 9061 | 51 | 1.684 | 0.000 |
Observations and Insights - Outlier Analysis
# TEMP USE BACKUP FOR CODING ONLY
df_eda1 = df_eda.copy()
In order to conduct a more accurate bivariate analysis, the y-target variable will be converted from a string category data type, into a integer data type, so that it can be used in correlation or covariance calculations.
The analysis will then be focused on attributes in relation to the Attrition_Flag
# remove string values and replace with binary values, then converting to int64 data type
df_eda['Attrition_Flag'] = df_eda['Attrition_Flag'].str.replace('Existing Customer', '1')
df_eda['Attrition_Flag'] = df_eda['Attrition_Flag'].str.replace('Attrited Customer', '0')
df_eda['Attrition_Flag'] = df_eda.Attrition_Flag.astype(np.int64)
print(df_eda['Attrition_Flag'].value_counts(ascending = False, dropna=False))
1 8500 0 1627 Name: Attrition_Flag, dtype: int64
# # Pairplot for checking covariances - Source: PF-W3-3.13 / EDA Univariate and Multivariate Analysis
# # df_eda.cov() # Print a covariance table displaying the of each attribute against every other attribute
# # Using scatter plots for every pair of attributes and histograms along the diagonal
# # Pairplot function with regression bar for initial highlevel analysis of anything standing out in the dataset
sns.pairplot(df_eda, hue="Attrition_Flag") # plots scatter plots for every pair of attributes and histograms along the diagonal
plt.figure(figsize=(25, 20)) # kind="reg" - "reg" provides regretion line
# "reg" shows a representation of a linear line that goes through the data
plt.show()
<Figure size 1800x1440 with 0 Axes>
Observations and Insights - Correlation Heatmap
The result of these strong pairs my be an indicator that at least one attribute from each of the pairs can be dropped from models
round(df_eda.cov(),2) # Print a covariance table displaying the of each attribute against every other attribute
| Attrition_Flag | Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition_Flag | 0.13 | -0.05 | -0.01 | -0.04 | 0.09 | -0.06 | -0.08 | 79.68 | 78.73 | 0.95 | 0.01 | 210.33 | 3.20 | 0.03 | 0.02 |
| Customer_Age | -0.05 | 64.27 | -1.27 | 50.51 | -0.14 | 0.44 | -0.16 | 180.43 | 96.57 | 83.86 | -0.11 | -1264.93 | -12.63 | -0.02 | 0.02 |
| Dependent_count | -0.01 | -1.27 | 1.69 | -1.07 | -0.08 | -0.01 | -0.06 | 803.54 | -2.85 | 806.38 | -0.01 | 110.52 | 1.52 | 0.00 | -0.01 |
| Months_on_book | -0.04 | 50.51 | -1.07 | 63.78 | -0.11 | 0.60 | -0.10 | 544.91 | 56.12 | 488.79 | -0.09 | -1047.00 | -9.34 | -0.03 | -0.02 |
| Total_Relationship_Count | 0.09 | -0.14 | -0.08 | -0.11 | 2.42 | -0.01 | 0.09 | -1008.51 | 17.39 | -1025.90 | 0.02 | -1833.55 | -8.83 | 0.02 | 0.03 |
| Months_Inactive_12_mon | -0.06 | 0.44 | -0.01 | 0.60 | -0.01 | 1.02 | 0.03 | -187.32 | -34.77 | -152.56 | -0.01 | -126.97 | -1.01 | -0.01 | -0.00 |
| Contacts_Count_12_mon | -0.08 | -0.16 | -0.06 | -0.10 | 0.09 | 0.03 | 1.22 | 209.30 | -48.61 | 257.90 | -0.01 | -423.80 | -3.95 | -0.03 | -0.02 |
| Credit_Limit | 79.68 | 180.43 | 803.54 | 544.91 | -1008.51 | -187.32 | 209.30 | 82605861.00 | 314752.85 | 82291108.15 | 25.53 | 5302297.03 | 16198.02 | -4.37 | -1210.16 |
| Total_Revolving_Bal | 78.73 | 96.57 | -2.85 | 56.12 | 17.39 | -34.77 | -48.61 | 314752.85 | 664204.36 | -349451.51 | 10.39 | 178217.22 | 1072.43 | 17.44 | 140.21 |
| Avg_Open_To_Buy | 0.95 | 83.86 | 806.38 | 488.79 | -1025.90 | -152.56 | 257.90 | 82291108.15 | -349451.51 | 82640559.65 | 15.13 | 5124079.81 | 15125.59 | -21.81 | -1350.37 |
| Total_Amt_Chng_Q4_Q1 | 0.01 | -0.11 | -0.01 | -0.09 | 0.02 | -0.01 | -0.01 | 25.53 | 10.39 | 15.13 | 0.05 | 29.55 | 0.03 | 0.02 | 0.00 |
| Total_Trans_Amt | 210.33 | -1264.93 | 110.52 | -1047.00 | -1833.55 | -126.97 | -423.80 | 5302297.03 | 178217.22 | 5124079.81 | 29.55 | 11540487.17 | 64364.97 | 69.22 | -77.77 |
| Total_Trans_Ct | 3.20 | -12.63 | 1.52 | -9.34 | -8.83 | -1.01 | -3.95 | 16198.02 | 1072.43 | 15125.59 | 0.03 | 64364.97 | 550.96 | 0.63 | 0.02 |
| Total_Ct_Chng_Q4_Q1 | 0.03 | -0.02 | 0.00 | -0.03 | 0.02 | -0.01 | -0.03 | -4.37 | 17.44 | -21.81 | 0.02 | 69.22 | 0.63 | 0.06 | 0.00 |
| Avg_Utilization_Ratio | 0.02 | 0.02 | -0.01 | -0.02 | 0.03 | -0.00 | -0.02 | -1210.16 | 140.21 | -1350.37 | 0.00 | -77.77 | 0.02 | 0.00 | 0.08 |
Observations and Insights - Corvariance Table
# Plotting Correlation Heatmap to view relation to 'Attrition_Flag' - SL-W2-2.6
sns.set(rc={'figure.figsize':(16,10)}) # set figure size
sns.heatmap(df_eda.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="Spectral") # use of heatmap function
plt.show() # plot map
# Using the correlation funct to calulate values between variables
df_eda.corr()['Attrition_Flag'].sort_values(ascending = False)
Attrition_Flag 1.000000 Total_Trans_Ct 0.371403 Total_Ct_Chng_Q4_Q1 0.290054 Total_Revolving_Bal 0.263053 Avg_Utilization_Ratio 0.178410 Total_Trans_Amt 0.168598 Total_Relationship_Count 0.150005 Total_Amt_Chng_Q4_Q1 0.131063 Credit_Limit 0.023873 Avg_Open_To_Buy 0.000285 Months_on_book -0.013687 Customer_Age -0.018203 Dependent_count -0.018991 Months_Inactive_12_mon -0.152449 Contacts_Count_12_mon -0.204491 Name: Attrition_Flag, dtype: float64
Observations and Insights - Correlation Heatmap in Relation to "Attrition_Flag"
In addition to examing the covariarance and coorelation calculations and plots we will exam, using mostly boxplots, other dataset attributes directly against the Attrition Flag attribute
sns.set(rc={"figure.figsize": (5, 3)})
sns.boxplot(x="Attrition_Flag", y="Customer_Age", data=df_eda, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Customer_Age'>
sns.set(rc={"figure.figsize": (5, 3)})
sns.boxplot(x="Attrition_Flag", y="Months_on_book", data=df_eda, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Months_on_book'>
sns.set(rc={"figure.figsize": (5, 3)})
sns.boxplot(x="Attrition_Flag", y="Months_Inactive_12_mon", data=df_eda, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Months_Inactive_12_mon'>
sns.set(rc={"figure.figsize": (5, 3)})
sns.boxplot(x="Attrition_Flag", y="Contacts_Count_12_mon", data=df_eda, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Contacts_Count_12_mon'>
sns.set(rc={"figure.figsize": (5, 3)})
sns.boxplot(x="Attrition_Flag", y="Credit_Limit", data=df_eda, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Credit_Limit'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Total_Revolving_Bal", hue="Card_Category", data=df_eda, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Total_Revolving_Bal'>
# sns bar plot
sns.barplot(df_eda["Attrition_Flag"],df_eda['Total_Revolving_Bal'],palette="PuBu")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Total_Revolving_Bal'>
The above bar plot displays that customers in the range of 700 to above 1200 are likely to leave Thera Bank's credit card services
sns.set(rc={"figure.figsize": (5, 3)})
sns.boxplot(x="Attrition_Flag", y="Total_Trans_Amt", data=df_eda, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Total_Trans_Amt'>
# sns bar plot
sns.barplot(df_eda["Attrition_Flag"],df_eda['Total_Trans_Amt'],palette="PuBu")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Total_Trans_Amt'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Total_Trans_Ct", hue="Card_Category", data=df_eda, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Total_Trans_Ct'>
sns.set(rc={"figure.figsize": (5, 3)})
sns.boxplot(x="Attrition_Flag", y="Total_Ct_Chng_Q4_Q1", data=df_eda, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Total_Ct_Chng_Q4_Q1'>
# sns bar plot
sns.barplot(df_eda["Attrition_Flag"],df_eda['Total_Ct_Chng_Q4_Q1'],palette="PuBu")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Total_Ct_Chng_Q4_Q1'>
sns.set(rc={"figure.figsize": (5, 3)})
sns.boxplot(x="Attrition_Flag", y="Avg_Utilization_Ratio", data=df_eda, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Avg_Utilization_Ratio'>
# sns bar plot
sns.barplot(df_eda["Attrition_Flag"],df_eda['Avg_Utilization_Ratio'],palette="PuBu")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Avg_Utilization_Ratio'>
# making a copy of the dataset to create a modeling dataset to work with
df_mod = df.copy()
Since the objective of the modeling is to help the bank improve its services so that existing customers do not renounce their credit cards, we will predict customers who will not renounce their credit cards. As a result, we will drop the columns that will no longer be applicable at the time of prediction for new data.
The attributes "Months_on_Book" and "Ave_Open_To_Buy" will be dropped based upon the strong positive correlations with "Customer_Age" and "Credit_Limit" attributes respectively
The attributes "Months_Inactive_12_mon", "Contacts_Count_12_mon","Total_Amt_Chng_Q4_Q1", "Total_Trans_Amt", "Total_Trans_Ct", "Total_Ct_Chng_Q4_Q1", "Avg_Utilization_Ratio" are calculated attributes of past data
# Dropping redundant or non-applicable attributes from the models
df_mod.drop(
columns=[
"Months_on_book",
"Avg_Open_To_Buy",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
], inplace=True)
Changing "Unknown" string value into missing data value
# converting string value into missing data value
df_mod[['Income_Category']] = df_mod[['Income_Category']].replace("Unknown", np.nan)
print("Number of Income_Category missing values: ", df_mod.Income_Category.isnull().sum())
Number of Income_Category missing values: 1112
First we will separate the dataset into the target dependant variable and independant variables datasets
X = df_mod.drop(["Attrition_Flag"], axis=1)
y = df_mod["Attrition_Flag"]
y.value_counts(1)
Existing Customer 0.83934 Attrited Customer 0.16066 Name: Attrition_Flag, dtype: float64
from sklearn.preprocessing import LabelEncoder
# Converting target variable into numerical form - MT-W2-2.10
le = LabelEncoder()
y = le.fit_transform(y)
# Existing Customer and Attrited Customer are replaced by 1 and 0 repectively,
le.transform(['Attrited Customer','Existing Customer'])
array([0, 1])
We will apply an overall split of the data into a ratio of 60:20:20
# Splitting data into training, validation and test sets:
# first we split data into 2 parts: temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we'll split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 10) (2026, 10) (2026, 10)
We will impute missing values into Education_Level, Marital_Status and Income_Category attributes using the SimpleImputer function applying the that will apply the the 'most frequent' strategy option for determining the value of each column.
# replace missing values with calculated median data
si = SimpleImputer(strategy='most_frequent')
mode_imputed_col=['Education_Level','Marital_Status','Income_Category']
#Fit and transform the training data
X_train[mode_imputed_col]=si.fit_transform(X_train[mode_imputed_col])
#Fit and transform the validation data
X_val[mode_imputed_col]=si.fit_transform(X_val[mode_imputed_col])
#Transform the test data
X_test[mode_imputed_col]=si.transform(X_test[mode_imputed_col])
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Total_Relationship_Count 0 Credit_Limit 0 Total_Revolving_Bal 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Total_Relationship_Count 0 Credit_Limit 0 Total_Revolving_Bal 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Total_Relationship_Count 0 Credit_Limit 0 Total_Revolving_Bal 0 dtype: int64
# Creating dummy variables for categorical variables
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)
X_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6075 entries, 9501 to 703 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 6075 non-null int64 1 Dependent_count 6075 non-null int64 2 Total_Relationship_Count 6075 non-null int64 3 Credit_Limit 6075 non-null float64 4 Total_Revolving_Bal 6075 non-null int64 5 Gender_M 6075 non-null uint8 6 Education_Level_Doctorate 6075 non-null uint8 7 Education_Level_Graduate 6075 non-null uint8 8 Education_Level_High School 6075 non-null uint8 9 Education_Level_Post-Graduate 6075 non-null uint8 10 Education_Level_Uneducated 6075 non-null uint8 11 Marital_Status_Married 6075 non-null uint8 12 Marital_Status_Single 6075 non-null uint8 13 Income_Category_$40K - $60K 6075 non-null uint8 14 Income_Category_$60K - $80K 6075 non-null uint8 15 Income_Category_$80K - $120K 6075 non-null uint8 16 Income_Category_Less than $40K 6075 non-null uint8 17 Card_Category_Gold 6075 non-null uint8 18 Card_Category_Platinum 6075 non-null uint8 19 Card_Category_Silver 6075 non-null uint8 dtypes: float64(1), int64(4), uint8(15) memory usage: 373.8 KB
Based upon the objective to build a classification model to predict if the customer is going to churn or not, the structure of the model predictions will be as follows:
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("logreg", LogisticRegression(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
score.append(scores)
print("{}: {}".format(name, scores))
Cross-Validation Performance: logreg: 99.43137254901961 Random forest: 96.9601493197868 Bagging: 90.91974061459716 GBM: 97.11678115799802 Adaboost: 96.07733456483673 Xgboost: 94.35169043083377 Validation Performance: logreg: 0.9911764705882353 Random forest: 0.9694117647058823 Bagging: 0.9188235294117647 GBM: 0.9705882352941176 Adaboost: 0.9605882352941176 Xgboost: 0.9505882352941176
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# defining a function to compute different metrics to check performance of sklearn model built
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
lr = LogisticRegression(random_state=1)
lr.fit(X_train, y_train)
LogisticRegression(random_state=1)
log_reg_model_train_perf = model_performance_classification_sklearn(
lr, X_train, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.840329 | 0.99294 | 0.844255 | 0.912581 |
scoring = "recall"
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
cv_result_bfr = cross_val_score(estimator=lr, X=X_train, y=y_train, scoring=scoring, cv=kfold)
cv_result_bfr
array([0.99117647, 0.99411765, 0.99411765, 0.99215686, 1. ])
log_reg_model_val_perf = model_performance_classification_sklearn(lr, X_val, y_val)
print("Validation performance:")
log_reg_model_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839585 | 0.991176 | 0.844612 | 0.912043 |
confusion_matrix_sklearn(lr, X_val, y_val)
Observations
#Fitting the Random Forest Classifier model
rf_estimator = RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train,y_train)
#Calculating different metrics
rf_estimator_model_train_perf=model_performance_classification_sklearn(rf_estimator,X_train,y_train)
print("Training performance:\n",rf_estimator_model_train_perf)
rf_estimator_model_val_perf=model_performance_classification_sklearn(rf_estimator,X_val, y_val)
print("Validation performance:\n",rf_estimator_model_val_perf)
#Creating confusion matrix
confusion_matrix_sklearn(rf_estimator,X_val,y_val)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Validation performance:
Accuracy Recall Precision F1
0 0.84304 0.969412 0.861024 0.912009
Observations
#Fitting the Bagging Classifier model
bagging_classifier = BaggingClassifier(random_state=1)
bagging_classifier.fit(X_train,y_train)
#Calculating different metrics
bagging_classifier_model_train_perf=model_performance_classification_sklearn(bagging_classifier,X_train,y_train)
print("Training performance:\n",bagging_classifier_model_train_perf)
bagging_classifier_model_val_perf=model_performance_classification_sklearn(bagging_classifier,X_val,y_val)
print("Validation performance:\n",bagging_classifier_model_val_perf)
#Creating confusion matrix
confusion_matrix_sklearn(bagging_classifier,X_val,y_val)
Training performance:
Accuracy Recall Precision F1
0 0.989465 0.995097 0.992372 0.993733
Validation performance:
Accuracy Recall Precision F1
0 0.825271 0.918824 0.878515 0.898217
Observations
#Fitting the AdaBoost Classifier model
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train,y_train)
#Calculating different metrics
ab_classifier_model_train_perf=model_performance_classification_sklearn(ab_classifier,X_train,y_train)
print("Training performance:\n",ab_classifier_model_train_perf)
ab_classifier_model_val_perf=model_performance_classification_sklearn(ab_classifier,X_val,y_val)
print("Validation performance:\n",ab_classifier_model_val_perf)
#Creating confusion matrix
confusion_matrix_sklearn(ab_classifier,X_val,y_val)
Training performance:
Accuracy Recall Precision F1
0 0.854815 0.964503 0.875245 0.917709
Validation performance:
Accuracy Recall Precision F1
0 0.844028 0.960588 0.867694 0.911781
Observations
#Fitting the Gradient Boosting Classifier model
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)
#Calculating different metrics
gb_classifier_model_train_perf=model_performance_classification_sklearn(gb_classifier,X_train,y_train)
print("Training performance:\n",gb_classifier_model_train_perf)
gb_classifier_model_val_perf=model_performance_classification_sklearn(gb_classifier,X_val,y_val)
print("Validation performance:\n",gb_classifier_model_val_perf)
#Creating confusion matrix
confusion_matrix_sklearn(gb_classifier,X_val,y_val)
Training performance:
Accuracy Recall Precision F1
0 0.87144 0.979408 0.880776 0.927477
Validation performance:
Accuracy Recall Precision F1
0 0.850938 0.970588 0.867508 0.916158
Observations
#Fitting the XGBoost Classifier model
xgb_classifier = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_classifier.fit(X_train,y_train)
#Calculating different metrics
xgb_classifier_model_train_perf=model_performance_classification_sklearn(xgb_classifier,X_train,y_train)
print("Training performance:\n",xgb_classifier_model_train_perf)
xgb_classifier_model_val_perf=model_performance_classification_sklearn(xgb_classifier,X_val,y_val)
print("Validation performance:\n",xgb_classifier_model_val_perf)
#Creating confusion matrix
confusion_matrix_sklearn(xgb_classifier,X_val,y_val)
Training performance:
Accuracy Recall Precision F1
0 0.966584 0.99647 0.964869 0.980415
Validation performance:
Accuracy Recall Precision F1
0 0.838598 0.950588 0.869285 0.90812
Observations
# Model Validation performance comparison
models_val_comp_df = pd.concat(
[
# log_reg_model_val_perf.T,
# rf_estimator_model_val_perf.T,
bagging_classifier_model_val_perf.T,
ab_classifier_model_val_perf.T,
gb_classifier_model_val_perf.T,
xgb_classifier_model_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
# "Logistic Regression"
# "Random Forest Classifier"
"Bagging_Classifier",
"AdaBoost Classifier",
"GB Classifier",
"XGB Classifier",
]
print("Validation Performance Comparison:")
models_val_comp_df
Validation Performance Comparison:
| Bagging_Classifier | AdaBoost Classifier | GB Classifier | XGB Classifier | |
|---|---|---|---|---|
| Accuracy | 0.825271 | 0.844028 | 0.850938 | 0.838598 |
| Recall | 0.918824 | 0.960588 | 0.970588 | 0.950588 |
| Precision | 0.878515 | 0.867694 | 0.867508 | 0.869285 |
| F1 | 0.898217 | 0.911781 | 0.916158 | 0.908120 |
print("Logistic Regression Validation Results", )
display(log_reg_model_val_perf.T)
print("Random Forest Classifier Validation Results", )
display(rf_estimator_model_val_perf.T)
Logistic Regression Validation Results
| 0 | |
|---|---|
| Accuracy | 0.839585 |
| Recall | 0.991176 |
| Precision | 0.844612 |
| F1 | 0.912043 |
Random Forest Classifier Validation Results
| 0 | |
|---|---|
| Accuracy | 0.843040 |
| Recall | 0.969412 |
| Precision | 0.861024 |
| F1 | 0.912009 |
Model Observations
Next we will examine the Training set result when Oversampling using SMOTE is applied to the Training dataset
print("Before UpSampling, counts of label 'Attrited': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'Existing': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Attrited': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'Existing': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Attrited': 5099 Before UpSampling, counts of label 'Existing': 976 After UpSampling, counts of label 'Attrited': 5099 After UpSampling, counts of label 'Existing': 5099 After UpSampling, the shape of train_X: (10198, 20) After UpSampling, the shape of train_y: (10198,)
log_reg_over = LogisticRegression(random_state=1)
# Training the basic logistic regression model with training set
log_reg_over.fit(X_train_over, y_train_over)
LogisticRegression(random_state=1)
# Calculating different metrics on train set
log_reg_over_train_perf = model_performance_classification_sklearn(
log_reg_over, X_train_over, y_train_over
)
print("Training performance:")
log_reg_over_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.771132 | 0.780349 | 0.766224 | 0.773222 |
# creating confusion matrix
confusion_matrix_sklearn(log_reg_over, X_val, y_val)
#Fitting the Random Forest Classifier model
rf_over_estimator = RandomForestClassifier(random_state=1)
rf_over_estimator.fit(X_train_over,y_train_over)
#Calculating different metrics
rf_over_estimator_model_train_perf=model_performance_classification_sklearn(rf_over_estimator,X_train_over,y_train_over)
print("Oversampled Training performance:\n",rf_over_estimator_model_train_perf)
#Creating confusion matrix
confusion_matrix_sklearn(rf_over_estimator,X_train_over,y_train_over)
Oversampled Training performance:
Accuracy Recall Precision F1
0 0.999804 0.999608 1.0 0.999804
#Fitting the Bagging Classifier model
bagging_over_classifier = BaggingClassifier(random_state=1)
bagging_over_classifier.fit(X_train_over,y_train_over)
#Calculating different metrics
bagging_over_classifier_model_train_perf=model_performance_classification_sklearn(bagging_classifier,X_train_over,y_train_over)
print("Oversampled Training performance:\n",bagging_over_classifier_model_train_perf)
#Creating confusion matrix
confusion_matrix_sklearn(bagging_over_classifier,X_train_over,y_train_over)
Oversampled Training performance:
Accuracy Recall Precision F1
0 0.794371 0.995097 0.710048 0.828746
#Fitting the AdaBoost Classifier model
ab_over_classifier = AdaBoostClassifier(random_state=1)
ab_over_classifier.fit(X_train_over,y_train_over)
#Calculating different metrics
ab_over_classifier_model_train_perf=model_performance_classification_sklearn(ab_classifier,X_train_over,y_train_over)
print("Oversampled Training performance:\n",ab_over_classifier_model_train_perf)
#Creating confusion matrix
confusion_matrix_sklearn(ab_over_classifier,X_train_over,y_train_over)
Oversampled Training performance:
Accuracy Recall Precision F1
0 0.629241 0.964503 0.577366 0.722332
Observations
#Fitting the Gradient Boosting Classifier model
gb_over_classifier = GradientBoostingClassifier(random_state=1)
gb_over_classifier.fit(X_train_over,y_train_over)
#Calculating different metrics
gb_over_classifier_model_train_perf=model_performance_classification_sklearn(gb_over_classifier,X_train_over,y_train_over)
print("Training performance:\n",gb_over_classifier_model_train_perf)
#Creating confusion matrix
confusion_matrix_sklearn(gb_over_classifier,X_train_over,y_train_over)
Training performance:
Accuracy Recall Precision F1
0 0.876741 0.89645 0.862453 0.879123
Observations
#Fitting the XGBoost model
xgb_over_classifier = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_over_classifier.fit(X_train_over,y_train_over)
#Calculating different metrics
xgb_over_classifier_model_train_perf=model_performance_classification_sklearn(xgb_over_classifier,
X_train_over,y_train_over)
print("Oversampled Training performance:\n",xgb_over_classifier_model_train_perf)
#Creating confusion matrix
confusion_matrix_sklearn(xgb_over_classifier,X_train_over,y_train_over)
Oversampled Training performance:
Accuracy Recall Precision F1
0 0.97676 0.980584 0.973141 0.976849
Observations
# Upsampling performance comparison
upsampling_comp_df = pd.concat(
[
# log_reg_model_val_perf.T,
# rf_estimator_model_val_perf.T,
bagging_over_classifier_model_train_perf.T,
ab_over_classifier_model_train_perf.T,
gb_over_classifier_model_train_perf.T,
xgb_over_classifier_model_train_perf.T,
],
axis=1,
)
upsampling_comp_df.columns = [
# "Logistic Regression"
# "Random Forest Classifier"
"Bagging Upsampling",
"AdaBoost Upsampling",
"GB Upsamplingr",
"XGB Upsampling",
]
print("Upsampling Performance Comparison:")
upsampling_comp_df
Upsampling Performance Comparison:
| Bagging Upsampling | AdaBoost Upsampling | GB Upsamplingr | XGB Upsampling | |
|---|---|---|---|---|
| Accuracy | 0.794371 | 0.629241 | 0.876741 | 0.976760 |
| Recall | 0.995097 | 0.964503 | 0.896450 | 0.980584 |
| Precision | 0.710048 | 0.577366 | 0.862453 | 0.973141 |
| F1 | 0.828746 | 0.722332 | 0.879123 | 0.976849 |
print("Logistic Regression Upsampling Results", )
display(log_reg_over_train_perf.T)
print("Random Forest Classifier Upsampling Results", )
display(rf_over_estimator_model_train_perf.T)
Logistic Regression Upsampling Results
| 0 | |
|---|---|
| Accuracy | 0.771132 |
| Recall | 0.780349 |
| Precision | 0.766224 |
| F1 | 0.773222 |
Random Forest Classifier Upsampling Results
| 0 | |
|---|---|
| Accuracy | 0.999804 |
| Recall | 0.999608 |
| Precision | 1.000000 |
| F1 | 0.999804 |
OverSampling Observations
Next we will examine the Training set result when Undersampling using RandomUnderSampler is applied to the Training dataset
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 5099 Before Under Sampling, counts of label 'No': 976 After Under Sampling, counts of label 'Yes': 976 After Under Sampling, counts of label 'No': 976 After Under Sampling, the shape of train_X: (1952, 20) After Under Sampling, the shape of train_y: (1952,)
log_reg_under = LogisticRegression(random_state=1)
log_reg_under.fit(X_train_un, y_train_un)
LogisticRegression(random_state=1)
# Calculating different metrics on train set
log_reg_under_train_perf = model_performance_classification_sklearn(
log_reg_under, X_train_un, y_train_un
)
print("Undersampled Training performance:")
log_reg_under_train_perf
Undersampled Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.713627 | 0.711066 | 0.714727 | 0.712892 |
# creating confusion matrix
confusion_matrix_sklearn(log_reg_under, X_val, y_val)
Observations
#Fitting the model
rf_un_estimator = RandomForestClassifier(random_state=1)
rf_un_estimator.fit(X_train_un, y_train_un)
#Calculating different metrics
rf_un_estimator_model_train_perf=model_performance_classification_sklearn(rf_un_estimator,X_train_un, y_train_un)
print("Oversampled Training performance:\n",rf_un_estimator_model_train_perf)
#Creating confusion matrix
confusion_matrix_sklearn(rf_un_estimator,X_train_un,y_train_un)
Oversampled Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Observations
#Fitting the Bagging Classifier model
bagging_under_classifier = BaggingClassifier(random_state=1)
bagging_under_classifier.fit(X_train_un, y_train_un)
#Calculating different metrics
bagging_under_classifier_model_train_perf=model_performance_classification_sklearn(bagging_under_classifier,X_train_un, y_train_un)
print("Undersampled Training performance:\n",bagging_under_classifier_model_train_perf)
#Creating confusion matrix
confusion_matrix_sklearn(bagging_under_classifier,X_train_un, y_train_un)
Undersampled Training performance:
Accuracy Recall Precision F1
0 0.982582 0.970287 0.994748 0.982365
Observations
#Fitting the AdaBoost Classifier model
ab_under_classifier = AdaBoostClassifier(random_state=1)
ab_under_classifier.fit(X_train_un, y_train_un)
#Calculating different metrics
ab_under_classifier_model_train_perf=model_performance_classification_sklearn(ab_classifier,X_train_un, y_train_un)
print("Undersampled Training performance:\n", ab_under_classifier_model_train_perf)
#Creating confusion matrix
confusion_matrix_sklearn(ab_under_classifier,X_train_un, y_train_un)
Undersampled Training performance:
Accuracy Recall Precision F1
0 0.625 0.968238 0.574119 0.720824
Observations
#Fitting the Gradient Boosting Classifier model
gb_under_classifier = GradientBoostingClassifier(random_state=1)
gb_under_classifier.fit(X_train_un, y_train_un)
#Calculating different metrics
gb_under_classifier_model_train_perf=model_performance_classification_sklearn(gb_under_classifier,X_train_un, y_train_un)
print("Undersampled Training performance:\n",gb_under_classifier_model_train_perf)
#Creating confusion matrix
confusion_matrix_sklearn(gb_under_classifier,X_train_un, y_train_un)
Undersampled Training performance:
Accuracy Recall Precision F1
0 0.80584 0.771516 0.828383 0.798939
Observations
#Fitting the XGBoost model
xgb_under_classifier = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_under_classifier.fit(X_train_un, y_train_un)
#Calculating different metrics
xgb_under_classifier_model_train_perf=model_performance_classification_sklearn(xgb_over_classifier,
X_train_un, y_train_un)
print("Undersampled Training performance:\n",xgb_over_classifier_model_train_perf)
#Creating confusion matrix
confusion_matrix_sklearn(xgb_under_classifier,X_train_un, y_train_un)
Undersampled Training performance:
Accuracy Recall Precision F1
0 0.97676 0.980584 0.973141 0.976849
# Undersampling performance comparison
un_sampling_comp_df = pd.concat(
[
# log_reg_model_val_perf.T,
# rf_estimator_model_val_perf.T,
bagging_under_classifier_model_train_perf.T,
ab_under_classifier_model_train_perf.T,
gb_under_classifier_model_train_perf.T,
xgb_under_classifier_model_train_perf.T,
],
axis=1,
)
un_sampling_comp_df.columns = [
# "Logistic Regression"
# "Random Forest Classifier"
"Bagging Under Sampling",
"AdaBoost Under Sampling",
"GB Under Sampling",
"XGB Under Sampling",
]
print("Under Sampling Performance Comparison:")
un_sampling_comp_df
Under Sampling Performance Comparison:
| Bagging Under Sampling | AdaBoost Under Sampling | GB Under Sampling | XGB Under Sampling | |
|---|---|---|---|---|
| Accuracy | 0.982582 | 0.625000 | 0.805840 | 0.929816 |
| Recall | 0.970287 | 0.968238 | 0.771516 | 0.986680 |
| Precision | 0.994748 | 0.574119 | 0.828383 | 0.885925 |
| F1 | 0.982365 | 0.720824 | 0.798939 | 0.933592 |
print("Logistic Regression Upsampling Results", )
display(log_reg_under_train_perf.T)
print("Random Forest Classifier Upsampling Results", )
display(rf_un_estimator_model_train_perf.T)
Logistic Regression Upsampling Results
| 0 | |
|---|---|
| Accuracy | 0.713627 |
| Recall | 0.711066 |
| Precision | 0.714727 |
| F1 | 0.712892 |
Random Forest Classifier Upsampling Results
| 0 | |
|---|---|
| Accuracy | 1.0 |
| Recall | 1.0 |
| Precision | 1.0 |
| F1 | 1.0 |
UnderSampling Observations
Based upon the results from the first set of six models above, although Recall is above 0.95 we will attempt to tune the top three models Random Forest, GBC, and XGB with the highest validation recall scores.
Though already high validation scores, the scores were aas follows:
# RandomForestClassifier Tuning
rf_tuned = RandomForestClassifier(class_weight={0:0.16,1:0.84},random_state=1,oob_score=True,bootstrap=True)
parameters = {
'max_depth': list(np.arange(5,30,5)) + [None],
'max_features': ['sqrt','log2',None],
'min_samples_leaf': np.arange(1,15,5),
'min_samples_split': np.arange(2, 20, 5),
'n_estimators': np.arange(5,50,100)}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=5,n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.16, 1: 0.84}, max_depth=15,
max_features=None, min_samples_leaf=6,
min_samples_split=17, n_estimators=5, oob_score=True,
random_state=1)
#Calculating different metrics
rf_tuned_model_train_perf=model_performance_classification_sklearn(rf_tuned,X_train,y_train)
print("Training performance:\n",rf_tuned_model_train_perf)
rf_tuned_model_val_perf=model_performance_classification_sklearn(rf_tuned,X_val,y_val)
print("Validation performance:\n",rf_tuned_model_val_perf)
#Creating confusion matrix
confusion_matrix_sklearn(rf_tuned,X_val,y_val)
Training performance:
Accuracy Recall Precision F1
0 0.860412 0.998823 0.85813 0.923147
Validation performance:
Accuracy Recall Precision F1
0 0.84847 0.994706 0.850176 0.91678
Random Forest Classifier Tuning Observations
feature_names = X_val.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Random Forest Classifier Feature Importances Observations
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [150,200,250,300],
"subsample":[0.7,0.8,0.9],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.9, n_estimators=150, random_state=1,
subsample=0.7)
#Calculating different metrics
gbc_tuned_model_train_perf=model_performance_classification_sklearn(gbc_tuned,X_train,y_train)
print("Training performance:\n",gbc_tuned_model_train_perf)
gbc_tuned_model_val_perf=model_performance_classification_sklearn(gbc_tuned,X_val,y_val)
print("Validation performance:\n",gbc_tuned_model_val_perf)
#Creating confusion matrix
confusion_matrix_sklearn(gbc_tuned,X_val,y_val)
Training performance:
Accuracy Recall Precision F1
0 0.877202 0.982349 0.884201 0.930695
Validation performance:
Accuracy Recall Precision F1
0 0.850444 0.968824 0.868213 0.915763
Obsevations - Gradient Boosting Tuning
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
#Let's try different max_depth for base_estimator
"base_estimator":[DecisionTreeClassifier(max_depth=1),DecisionTreeClassifier(max_depth=2),
DecisionTreeClassifier(max_depth=3)],
"n_estimators": np.arange(10,110,10),
"learning_rate":np.arange(0.1,2,0.1)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=0.2, n_estimators=10, random_state=1)
#Calculating different metrics
abc_tuned_model_train_perf=model_performance_classification_sklearn(abc_tuned,X_train,y_train)
print(abc_tuned_model_train_perf)
abc_tuned_model_val_perf=model_performance_classification_sklearn(abc_tuned,X_val,y_val)
print("AdaBoost Classifier Tuning Validation scores: ", abc_tuned_model_val_perf)
#Creating confusion matrix
confusion_matrix_sklearn(abc_tuned,X_val,y_val)
Accuracy Recall Precision F1 0 0.856132 0.976074 0.868738 0.919283 AdaBoost Classifier Tuning Validation scores: Accuracy Recall Precision F1 0 0.852419 0.978824 0.863518 0.917563
Observations - AdaBoost Tuning
# Tuned Model Validation performance comparison
print("Random Forest Tuning Validation performance:\n",rf_tuned_model_val_perf)
print("GBC Tuning Validation performance:\n", gbc_tuned_model_val_perf)
print("AdaBoost Tuning Validation scores:\n", abc_tuned_model_val_perf)
Random Forest Tuning Validation performance:
Accuracy Recall Precision F1
0 0.84847 0.994706 0.850176 0.91678
GBC Tuning Validation performance:
Accuracy Recall Precision F1
0 0.850444 0.968824 0.868213 0.915763
AdaBoost Tuning Validation scores:
Accuracy Recall Precision F1
0 0.852419 0.978824 0.863518 0.917563
Observations
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 50, 'learning_rate': 0.01, 'base_estimator': DecisionTreeClassifier(max_depth=1, random_state=1)} with CV score=1.0:
Wall time: 57.9 s
# building model with best parameters
adb_tuned2 = AdaBoostClassifier(
n_estimators=50,
learning_rate= 0.01,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=1, random_state=1),
)
# Fit the model on training data
adb_tuned2.fit(X_Train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1,
random_state=1),
learning_rate=0.01, random_state=1)
# Calculating different metrics on train set
Adaboost_random_train = model_performance_classification_sklearn(
adb_tuned2, X_train, y_train
)
print("Training performance:")
Adaboost_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839342 | 1.0 | 0.839342 | 0.912654 |
# Calculating different metrics on validation set
Adaboost_random_val = model_performance_classification_sklearn(adb_tuned2, X_val, y_val)
print("Validation performance:")
Adaboost_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839092 | 1.0 | 0.839092 | 0.912507 |
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned2, X_val, y_val)
Observations
%%time
# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
'reg_lambda':[5,10]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
xgb_tuned2 = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
xgb_tuned2.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(xgb_tuned2.best_params_,xgb_tuned2.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 2, 'reg_lambda': 5, 'n_estimators': 100, 'max_depth': 1, 'learning_rate': 0.1, 'gamma': 3} with CV score=1.0:
Wall time: 31.1 s
# building model with best parameters
xgb_tuned2 = XGBClassifier(
random_state=1,
n_estimators=100,
scale_pos_weight=2,
gamma=1,
subsample=0.9,
learning_rate=0.1,
eval_metric="logloss",
max_depth=1,
reg_lambda=10,
)
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=1, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.1, max_delta_step=0,
max_depth=1, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=4,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=10,
scale_pos_weight=2, subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None)
# Calculating different metrics on train set
xgboost_random_train = model_performance_classification_sklearn(
xgb_tuned2, X_train, y_train
)
print("Training performance:")
xgboost_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839342 | 1.0 | 0.839342 | 0.912654 |
# Calculating different metrics on validation set
xgboost_random_val = model_performance_classification_sklearn(xgb_tuned2, X_val, y_val)
print("Validation performance:")
xgboost_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839092 | 1.0 | 0.839092 | 0.912507 |
# creating confusion matrix
confusion_matrix_sklearn(xgb_tuned2, X_val, y_val)
Observation
# training performance comparison
models_train_comp_df = pd.concat(
[
Adaboost_random_train.T,
xgboost_random_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"AdaBoost Tuned with Random search",
"Xgboost Tuned with Random Search",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| AdaBoost Tuned with Random search | Xgboost Tuned with Random Search | |
|---|---|---|
| Accuracy | 0.839342 | 0.839342 |
| Recall | 1.000000 | 1.000000 |
| Precision | 0.839342 | 0.839342 |
| F1 | 0.912654 | 0.912654 |
Observations
We will build a simple pipe using the minmaxscaler function and the SVC algorithm
# pipeline with 1st stage being minmaxscaler function,
# 2nd (last) stage being the SVC algorithm and applying an object name to it
pipe = Pipeline([(" scaler", MinMaxScaler()), (" svm", SVC())])
# apply pipe.fit to X_train, y_train datasets,
pipe.fit( X_train, y_train)
Pipeline(steps=[(' scaler', MinMaxScaler()), (' svm', SVC())])
# automatically calls the score function of the model
print(" Test score: {:.2f}". format( pipe.score( X_test, y_test)))
Test score: 0.84
The test score is based upon the model and not the pipe function
# to call the predict (last step) of the pipe
y_pred = pipe.predict(X_test)
# output the classification report of class level scores from the model results
print(metrics.classification_report(y_test, y_pred))
precision recall f1-score support
0 0.00 0.00 0.00 325
1 0.84 1.00 0.91 1701
accuracy 0.84 2026
macro avg 0.42 0.50 0.46 2026
weighted avg 0.70 0.84 0.77 2026
Observation
Based upon the EDA Insight, a suggested Profile of Customers’ Leaving Thera Bank’s Credit Cards Services would be: